# Imports

In [10]:
import pandas as pd,numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

In [13]:
'''
from sklearn.svm import SVC
SVC stands for Support Vector Classification, and it's a machine learning model provided by the scikit-learn (sklearn) library.
Support Vector Machine (SVM) is a supervised learning algorithm used for classification tasks (and regression tasks, through SVR).
How it works: SVM finds the hyperplane (or decision boundary) that best separates the data into different classes. The points closest to the hyperplane (called support vectors) are critical to the classifier's accuracy.
Common Usage: SVM is commonly used for binary and multi-class classification tasks, such as text classification, image recognition, and more.
from sklearn.model_selection import train_test_split
train_test_split is a function from scikit-learn that splits a dataset into two subsets: one for training the model and one for testing the model. This is essential to evaluate how well your model generalizes to unseen data.
Why is it important?: Splitting data into training and testing sets helps you evaluate the performance of your machine learning model and reduces the risk of overfitting (when a model is too closely fit to the training data and fails to generalize to new data).
Common Usage: You typically split data into training and testing sets in an 80/20 or 70/30 ratio.
CountVectorizer is a text feature extraction tool in scikit-learn. It converts a collection of text documents into a matrix of token counts, i.e., it transforms text into numerical features.
How it works: CountVectorizer creates a bag-of-words model. It treats each unique word as a feature and counts the number of occurrences of each word in each document.
Common Usage: This is typically used in text classification problems (e.g., spam detection, sentiment analysis), where you want to represent text as a set of features for a machine learning model.
LabelEncoder is used for encoding categorical labels into numerical values. This is particularly useful in machine learning when you need to convert categorical target labels (e.g., 'spam', 'ham' in email classification) into numerical format, which is required by most machine learning algorithms.
How it works: It assigns a unique integer to each label (category) in the target variable. For example, if you have a binary classification task with labels 'spam' and 'ham', LabelEncoder would convert 'spam' to 0 and 'ham' to 1.
Common Usage: Typically used for encoding target labels in classification tasks, but can also be used for encoding categorical features if needed.
'''


"\nfrom sklearn.svm import SVC\nSVC stands for Support Vector Classification, and it's a machine learning model provided by the scikit-learn (sklearn) library.\nSupport Vector Machine (SVM) is a supervised learning algorithm used for classification tasks (and regression tasks, through SVR).\nHow it works: SVM finds the hyperplane (or decision boundary) that best separates the data into different classes. The points closest to the hyperplane (called support vectors) are critical to the classifier's accuracy.\nCommon Usage: SVM is commonly used for binary and multi-class classification tasks, such as text classification, image recognition, and more.\nfrom sklearn.model_selection import train_test_split\ntrain_test_split is a function from scikit-learn that splits a dataset into two subsets: one for training the model and one for testing the model. This is essential to evaluate how well your model generalizes to unseen data.\nWhy is it important?: Splitting data into training and testing 

In [14]:
'''
Data Loading:

You would typically use pandas to load a dataset (e.g., a CSV file) into a DataFrame and perform any necessary data preprocessing (cleaning, handling missing values, etc.).
Feature Extraction (Text Data):

If your dataset consists of text (like email content or reviews), you would use CountVectorizer to convert the text into numerical features, typically in the form of a document-term matrix (each document is represented as a vector of word counts).
Label Encoding:

If your target labels are categorical (e.g., 'spam', 'ham'), you would use LabelEncoder to convert them into numeric values that can be used by machine learning algorithms.
Data Splitting:

You would split the dataset into training and test sets using train_test_split to ensure that the model is evaluated on unseen data, helping you assess its performance and generalizability.
Modeling:

With the training data, you would use SVC (Support Vector Classification) to build a classifier. The model is trained using the features (from CountVectorizer) and the corresponding labels (after encoding).
Evaluation:

Once the model is trained, you would use the test data to evaluate how well it performs by comparing the predicted labels (y_pred) with the actual test labels (y_test).
'''

"\nData Loading:\n\nYou would typically use pandas to load a dataset (e.g., a CSV file) into a DataFrame and perform any necessary data preprocessing (cleaning, handling missing values, etc.).\nFeature Extraction (Text Data):\n\nIf your dataset consists of text (like email content or reviews), you would use CountVectorizer to convert the text into numerical features, typically in the form of a document-term matrix (each document is represented as a vector of word counts).\nLabel Encoding:\n\nIf your target labels are categorical (e.g., 'spam', 'ham'), you would use LabelEncoder to convert them into numeric values that can be used by machine learning algorithms.\nData Splitting:\n\nYou would split the dataset into training and test sets using train_test_split to ensure that the model is evaluated on unseen data, helping you assess its performance and generalizability.\nModeling:\n\nWith the training data, you would use SVC (Support Vector Classification) to build a classifier. The model

# Data importing

In [15]:
data = pd.read_csv(r"C:\Users\samay\OneDrive\Desktop\Comp_lab_2\spam2.csv",encoding='utf-8',usecols=['v1','v2'])

data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Checking missing values and eliminating

In [16]:
data.isnull().sum()

v1    0
v2    1
dtype: int64

In [17]:
data.dropna(inplace=True)

# Converting String to Vector

In [18]:
cv = CountVectorizer()
labelenc = LabelEncoder()

x = cv.fit_transform(data['v2'])
y = labelenc.fit_transform(data['v1'])

In [19]:
'''
1. cv = CountVectorizer()
CountVectorizer is a feature extraction tool from the scikit-learn library. It is used to convert a collection of text documents into a matrix of token counts, which is commonly known as the Bag-of-Words (BoW) model.
What it does: It transforms a corpus of text (a collection of documents) into a document-term matrix (DTM), where each row represents a document, and each column represents a unique word (or term) from the entire corpus. The value in each cell of the matrix represents the count of the word in that document.
Example: If you have the following two sentences:

plaintext
Copy code
"I love programming"
"Programming is fun"
After applying CountVectorizer, you would get something like:

kotlin
Copy code
|     | I | love | programming | is | fun |
| --- | --- | ---- | ----------- | --- | --- |
| 1   | 1   | 1    | 1           | 0   | 0   |
| 2   | 0   | 0    | 1           | 1   | 1   |
In this case:

cv = CountVectorizer() initializes the CountVectorizer object. This will later be used to fit the text data and transform it into the document-term matrix.

fit_transform() method is used to both:

Fit the vectorizer to the data (i.e., learn the vocabulary from the corpus of text).
Transform the text data into a matrix of token counts.
python
Copy code
x = cv.fit_transform(data['v2'])
data['v2']: This presumably refers to a column in your data DataFrame that contains text documents. So, cv.fit_transform(data['v2']) will:
Learn the vocabulary from all the text in column v2.
Convert the text into a sparse matrix where each row represents a document, and each column represents the count of a specific word in that document.
Result: x will be a sparse matrix (or document-term matrix) with the token counts for each word across all documents in the data['v2'] column.

2. labelenc = LabelEncoder()
LabelEncoder is a utility class from the scikit-learn library used for encoding categorical labels into numerical values. It is most commonly used when you have a target variable that consists of labels in text form (e.g., 'spam', 'ham', 'positive', 'negative') and you need to convert those labels into numbers for machine learning algorithms, which usually require numerical inputs.
Example: If you have a target variable with labels like:

plaintext
Copy code
['spam', 'ham', 'ham', 'spam', 'spam']
The LabelEncoder will convert this into:

plaintext
Copy code
[1, 0, 0, 1, 1]
In your code:

python
Copy code
y = labelenc.fit_transform(data['v1'])
data['v1']: This refers to a column in your data DataFrame that contains the labels (e.g., 'spam', 'ham', 'positive', 'negative').
fit_transform():
fit(): Learns the unique categories (or classes) in the column.
transform(): Converts each category (or class) into a corresponding integer label.
Result: y will be a numpy array containing the numerical labels corresponding to each label in the data['v1'] column.

For example, if data['v1'] contained:

plaintext
Copy code
['spam', 'ham', 'ham', 'spam', 'spam']
After applying LabelEncoder(), y will be:

plaintext
Copy code
[1, 0, 0, 1, 1]
Here, 'spam' might be encoded as 1, and 'ham' might be encoded as 0.

Putting it Together:
x = cv.fit_transform(data['v2']):

This line processes the text in the data['v2'] column (which likely contains text documents) and transforms it into a sparse matrix of token counts (document-term matrix).
x will be the feature set for the model, where each row corresponds to a document, and each column represents a word from the vocabulary, with the value representing how many times that word appears in the document.
y = labelenc.fit_transform(data['v1']):

This line processes the labels in the data['v1'] column (which might contain categories like 'spam', 'ham', etc.) and converts them into numerical format so they can be used in machine learning algorithms.
y will be the target labels (numerically encoded) that the model will learn to predict.
Example Scenario:
Let's assume the dataset is related to spam email classification, where:

data['v2'] contains the email content.
data['v1'] contains the labels (e.g., 'spam' or 'ham').
cv.fit_transform(data['v2']):

The CountVectorizer learns the vocabulary of the emails in data['v2'] and converts the email content into numerical features (the frequency of each word in each email).
labelenc.fit_transform(data['v1']):

The LabelEncoder converts the labels (like 'spam' and 'ham') into numbers, such as 1 for 'spam' and 0 for 'ham'.
After these transformations, the model can use x (the features from the text) and y (the corresponding labels) for training a machine learning model, such as a Support Vector Machine (SVM), Naive Bayes, or any other classifier.

Final Workflow:
x contains the transformed text data (as a sparse matrix of word counts).
y contains the numerical labels (encoded as 0s and 1s).
This is typically followed by training a classifier (e.g., SVM, logistic regression) to predict the label (y) from the feature set (x).

'''



'\n1. cv = CountVectorizer()\nCountVectorizer is a feature extraction tool from the scikit-learn library. It is used to convert a collection of text documents into a matrix of token counts, which is commonly known as the Bag-of-Words (BoW) model.\nWhat it does: It transforms a corpus of text (a collection of documents) into a document-term matrix (DTM), where each row represents a document, and each column represents a unique word (or term) from the entire corpus. The value in each cell of the matrix represents the count of the word in that document.\nExample: If you have the following two sentences:\n\nplaintext\nCopy code\n"I love programming"\n"Programming is fun"\nAfter applying CountVectorizer, you would get something like:\n\nkotlin\nCopy code\n|     | I | love | programming | is | fun |\n| --- | --- | ---- | ----------- | --- | --- |\n| 1   | 1   | 1    | 1           | 0   | 0   |\n| 2   | 0   | 0    | 1           | 1   | 1   |\nIn this case:\n\ncv = CountVectorizer() initialize

# Spliting data into train and test set

In [20]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.25,random_state=42)

# Training a model

In [21]:
svc = SVC()
svc.fit(X_train,y_train)

In [22]:
'''
SVC stands for Support Vector Classification. It is a type of Support Vector Machine (SVM) algorithm provided by scikit-learn (a popular machine learning library in Python).
Support Vector Machine (SVM) is a supervised learning algorithm that is widely used for classification and regression tasks. It works by finding the hyperplane that best separates data points of different classes in a higher-dimensional space.

Support Vector Classification specifically refers to the classification task, where the goal is to separate data points belonging to different classes (e.g., spam vs. non-spam emails, cats vs. dogs in images) using a hyperplane.

When you initialize SVC(), you are creating an instance of the Support Vector Classifier. The SVC() class has several hyperparameters (like the kernel, C, gamma, etc.) that you can customize, but if you don't specify any, it will use the default settings.

For example, the default kernel is 'rbf' (Radial Basis Function), which is commonly used for classification tasks because it can model non-linear decision boundaries.
In this code, we are creating an SVC object called svc that will be used for classification. By default, this will use an RBF kernel.

2. svc.fit(X_train, y_train)
fit() is a method in scikit-learn that trains a machine learning model using the provided data. In this case, the model being trained is the SVC classifier.

X_train: This is the training data (the features) that the model will learn from. In the context of text classification, X_train could be a document-term matrix (or a set of extracted features from text), which represents the text data numerically.

Each row in X_train corresponds to a document (an individual data point), and each column corresponds to a specific feature (e.g., the frequency of a particular word, n-gram, or token).
y_train: These are the labels or target values corresponding to the training data. These represent the actual categories or classes that you want the model to predict (e.g., 'spam' or 'ham', 'cat' or 'dog').

How fit() works: The fit() method is where the model is actually trained. Here's what happens step-by-step:

The SVC algorithm uses the training data (X_train) and the corresponding labels (y_train) to find the optimal hyperplane (or decision boundary) that best separates the different classes in the feature space.

SVM tries to maximize the margin between the support vectors (the closest data points to the hyperplane). The support vectors are the critical data points that are closest to the decision boundary. This helps the model generalize better to unseen data.

The kernel function (by default, RBF) transforms the data into a higher-dimensional space to make it easier to find a linear separating hyperplane, even when the data is non-linearly separable in its original form.

The hyperparameters like C (regularization) and gamma (influence of a single training example) control the tradeoff between fitting the training data well and keeping the model simple to avoid overfitting.

In Summary:

X_train is the set of features (data) used to train the model.
y_train is the set of labels (or classes) associated with the data.
fit() is the method that trains the SVM model by learning from the data and adjusting its internal parameters (such as the support vectors) to best separate the classes.
3. Example of how SVC works:
Here’s a simplified example of how this might work in practice:

Suppose you are building a model to classify emails as either 'spam' or 'ham' (non-spam).

Preprocessing: First, you would process the raw email text into a numerical representation, such as a document-term matrix using CountVectorizer or TfidfVectorizer.

Feature and Label Extraction:

X_train will contain the features extracted from the email text.
y_train will contain the labels (0 for 'ham', 1 for 'spam').
Training the Model:

When you call svc.fit(X_train, y_train), the SVM classifier will analyze the training data and attempt to find the hyperplane (or decision boundary) that best separates the 'spam' and 'ham' emails based on the features in X_train.
Decision Boundary:

After training, the model will be ready to make predictions on new, unseen data by determining which side of the decision boundary the new data points fall on (i.e., which class they belong to).
4. Model Prediction:
Once the model has been trained using fit(), you can make predictions on new, unseen data with the predict() method:

python
Copy code
y_pred = svc.predict(X_test)  # Predict the labels for the test data
Here, X_test represents the test data (features) that the model has never seen before. The predict() method will classify the test data based on the hyperplane the model learned during training.

Summary of the Workflow:
Initialize SVC: You create an instance of the SVC classifier.

python
Copy code
svc = SVC()  # or svc = SVC(kernel='linear') for a linear kernel
Train the Model: You use fit() to train the model on the training data (X_train) and the corresponding labels (y_train).

python
Copy code
svc.fit(X_train, y_train)
Model Trained: The SVM algorithm now has an internal model with a decision boundary (hyperplane) that can separate the data into different classes. This model can then be used to make predictions on new data.

Key Points:
SVC() creates an SVM classifier.
.fit(X_train, y_train) trains the classifier using the training data and labels.
The SVM classifier learns the decision boundary that best separates the different classes in the feature space.
'''


"\nSVC stands for Support Vector Classification. It is a type of Support Vector Machine (SVM) algorithm provided by scikit-learn (a popular machine learning library in Python).\nSupport Vector Machine (SVM) is a supervised learning algorithm that is widely used for classification and regression tasks. It works by finding the hyperplane that best separates data points of different classes in a higher-dimensional space.\n\nSupport Vector Classification specifically refers to the classification task, where the goal is to separate data points belonging to different classes (e.g., spam vs. non-spam emails, cats vs. dogs in images) using a hyperplane.\n\nWhen you initialize SVC(), you are creating an instance of the Support Vector Classifier. The SVC() class has several hyperparameters (like the kernel, C, gamma, etc.) that you can customize, but if you don't specify any, it will use the default settings.\n\nFor example, the default kernel is 'rbf' (Radial Basis Function), which is commonly 

In [23]:
svc.score(X_test,y_test)

0.9763101220387652

In [24]:
'''
The score() method computes the accuracy of the trained model on the given test data (X_test) and the true labels (y_test).
Accuracy is defined as the proportion of correct predictions made by the model on the test set, compared to the total number of predictions.
'''

'\nThe score() method computes the accuracy of the trained model on the given test data (X_test) and the true labels (y_test).\nAccuracy is defined as the proportion of correct predictions made by the model on the test set, compared to the total number of predictions.\n'

# Testing

In [25]:
email = cv.transform(['Congratulations!!!, You won Lottery of $1500000000 just now, just click on following link https://lottery.com/claim to claim your prize money'])
labelenc.classes_[svc.predict(email)[0]]

'spam'

In [26]:
'''
email = cv.transform([...]):

cv.transform([...]): The CountVectorizer (cv) is used to transform the input email text into a numerical feature vector (using the vocabulary learned from the training data).
Here, the email text is passed as a list of strings, and the transform method converts the text into a document-term matrix (DTM).
Result: email is now a sparse matrix containing the word counts of the email, which can be used by the SVM model.
svc.predict(email):

svc.predict(...): The predict() method is used to make a prediction for the transformed input data (email).
The model (svc) will classify the email based on its learned patterns and return a predicted label (e.g., 0 or 1, depending on the classes it was trained on).
Result: The output is an array containing the predicted class label for the email. svc.predict(email)[0] selects the first element of the array, which is the predicted label.
labelenc.classes_[...]:

labelenc.classes_: This is an array of all unique class labels (i.e., the original text labels, such as 'spam' and 'ham') that were encoded into numerical values during training by the LabelEncoder.
The predicted label (e.g., 0 or 1) is used to index into this array, to map the numerical label back to its original class label.
Result: It retrieves the human-readable label (e.g., 'spam' or 'ham') based on the predicted number.
What happens overall:
The email text is transformed into a numerical vector using cv.transform().
The model (svc) predicts the label (e.g., 0 or 1) for the email.
The predicted label is converted back to its original class name (e.g., 'spam' or 'ham') using labelenc.classes_.
Example:
Assuming the model has been trained to classify emails into 'spam' (encoded as 1) and 'ham' (encoded as 0), the code will output:

labelenc.classes_[1] → 'spam' (if the email is predicted as spam).
labelenc.classes_[0] → 'ham' (if the email is predicted as non-spam).
'''

"\nemail = cv.transform([...]):\n\ncv.transform([...]): The CountVectorizer (cv) is used to transform the input email text into a numerical feature vector (using the vocabulary learned from the training data).\nHere, the email text is passed as a list of strings, and the transform method converts the text into a document-term matrix (DTM).\nResult: email is now a sparse matrix containing the word counts of the email, which can be used by the SVM model.\nsvc.predict(email):\n\nsvc.predict(...): The predict() method is used to make a prediction for the transformed input data (email).\nThe model (svc) will classify the email based on its learned patterns and return a predicted label (e.g., 0 or 1, depending on the classes it was trained on).\nResult: The output is an array containing the predicted class label for the email. svc.predict(email)[0] selects the first element of the array, which is the predicted label.\nlabelenc.classes_[...]:\n\nlabelenc.classes_: This is an array of all uniq