<a href="https://colab.research.google.com/github/GioGio2004/ML-documentation/blob/main/machine_learining_docs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Gathering information from online resources about machine learning**

code examples in this documentation is gathered from **Gemini** and other scorses

**coursera.org**

Top machine learning algorithms to know
From classification to regression, here are seven algorithms you need to know:

1. **Linear regression**:
Linear regression is a supervised learning algorithm used to predict and forecast values within a continuous range, such as sales numbers or prices.

Originating from statistics, linear regression performs a regression task, which maps a constant slope using an input value (X) with a variable output (Y) to predict a numeric value or quantity.

Linear regression uses labelled data to make predictions by establishing a line of best fit, or 'regression line', that is approximated from a scatter plot of data points. As a result, linear regression is used for predictive modelling rather than categorisation.


Linear Regression in Machine Learning

Linear regression is a fundamental supervised learning algorithm widely used for continuous variable prediction. It establishes a linear relationship between one or more independent variables (also known as features or predictors) and a dependent variable (also called target or outcome). The model learns the coefficients that best fit a straight line through the data points, enabling predictions for new unseen data.

Key Concepts

Supervised Learning: The model learns from labeled data where each data point has an associated target value.
Linear Relationship: The algorithm assumes a linear association between the features and the target variable. Non-linear relationships require more advanced techniques.
Model Fitting: The process of finding the coefficients (slope and intercept) that minimize the difference between the predicted values and the actual target values. This is often achieved using the ordinary least squares (OLS) method.
Prediction: Once the model is trained, you can use the equation with new feature values to predict the corresponding target value.
Code Examples

Python (Scikit-learn):

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data (replace with your actual data)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
new_data = np.array([[6]])  # Example: Predict for a new feature value of 6
predicted_value = model.predict(new_data)
print(predicted_value)  # Output: [[5.8]] (may vary slightly due to random factors)

# Get model coefficients (slope and intercept)
slope = model.coef_[0]
intercept = model.intercept_
print(f"Slope (m): {slope}, Intercept (b): {intercept}")


[5.8]
Slope (m): 0.6, Intercept (b): 2.2


Explanation:

Import libraries: numpy for numerical computations and LinearRegression from scikit-learn for the linear regression model.
Sample data: Replace with your actual data in a NumPy array format, where X is the feature matrix and y is the target vector.
Model creation: Instantiate a LinearRegression object.
Model training: Call model.fit(X, y) to train the model on the provided data.
Prediction: Use model.predict(new_data) to predict the target value for new feature values in new_data.
Model coefficients: Access the model's slope (model.coef_[0]) and intercept (model.intercept_) to understand the linear relationship learned.
Additional Notes

Linear regression is a versatile tool for tasks like:
Forecasting (e.g., predicting sales based on historical data)
Trend analysis (e.g., identifying linear trends in stock prices)
Correlation exploration (e.g., understanding the relationship between house size and price)
It's crucial to ensure a linear relationship between features and the target variable for optimal results. Data visualization and domain knowledge can help assess this.
Consider feature scaling for improved performance, especially when features have different units or scales.
Linear regression may not be suitable for highly non-linear relationships. Explore other learning algorithms like decision trees, support vector machines, or neural networks for such scenarios.
By understanding these concepts and using code examples like the one provided, you can effectively apply linear regression for continuous variable prediction in your machine learning projects.

**additional:**

**Ordinary Least Squares (OLS)**

OLS is a statistical method at the heart of linear regression. It aims to find the line (or hyperplane in higher dimensions) that best fits a set of data points by minimizing the sum of squared errors (SSE). In other words, OLS searches for the coefficients (slope and intercept) that make the predicted values from the regression line as close as possible to the actual target values.

Explanation:

Model Equation: The equation for a linear regression model is:


In [None]:
y = b0 + b1 * x + e

where:

y is the dependent variable (target value)
x is the independent variable (feature)
b0 is the intercept (y-axis value where the line crosses)
b1 is the slope (coefficient indicating the change in y for a unit change in x)
e is the error term (the difference between the actual value of y and the predicted value from the model)
Minimizing SSE: OLS focuses on minimizing the sum of squared errors (SSE), which is the sum of the squared distances between the predicted and actual values of y:

In [None]:
SSE = sum((y_i - (b0 + b1 * x_i))^2)

where y_i and x_i represent individual data points and the summation goes over all data points.

Calculus and Solution: By taking the partial derivative of SSE with respect to b0 and b1 and setting them to zero, we can find the values of b0 and b1 that minimize SSE. This leads to a system of equations that can be solved to obtain the optimal estimates for the intercept and slope.

Code Example (Python with Scikit-learn):

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data (replace with your actual data)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Model creation (OLS is built into LinearRegression)
model = LinearRegression()
model.fit(X, y)

# Get model coefficients (slope and intercept)
slope = model.coef_[0]
intercept = model.intercept_
print(f"OLS Slope (m): {slope}, Intercept (b): {intercept}")


Explanation:

We import numpy for numerical computations and LinearRegression from scikit-learn.
We create sample data in NumPy arrays for features (X) and target (y).
We instantiate a LinearRegression object (which uses OLS internally).
The model.fit(X, y) step trains the model, finding the optimal slope and intercept using OLS.
Finally, we access the learned coefficients using model.coef_[0] (slope) and model.intercept_ (intercept).
While Scikit-learn provides a convenient way to use OLS, understanding the underlying principle helps interpret the results and choose appropriate regression methods for your specific problems.

LinearRegression in scikit-learn

LinearRegression is a versatile class within scikit-learn's linear_model module for fitting and performing linear regression. It implements the Ordinary Least Squares (OLS) method to establish a linear relationship between one or more independent variables (features) and a continuous dependent variable (target).

Key Attributes and Methods:

fit(X, y): This method is the core of the model. It trains the model by fitting a linear model to the provided data (X is the feature matrix, y is the target vector). Scikit-learn employs OLS internally to find the optimal coefficients (slope and intercept) that minimize the sum of squared errors (SSE) between predicted and actual target values.
coef_: This read-only attribute stores the estimated coefficients for the linear regression problem. For single-target regression, it's a 1D array of length n_features, representing the coefficients for each feature. For multi-target regression (passing a 2D y), it's a 2D array of shape (n_targets, n_features).
intercept_: This read-only attribute holds the intercept (y-axis value where the regression line crosses) learned during model fitting.
predict(X): This method enables you to predict target values for new unseen data (X) based on the fitted model. It returns a 1D array of predicted target values for each input sample in X.
score(X, y, sample_weight=None): This method calculates the coefficient of determination (R-squared), a goodness-of-fit metric that indicates the proportion of variance in the target variable explained by the linear model. Higher R-squared values (closer to 1) suggest better model fit. The optional sample_weight argument allows weighted R-squared calculation.
Additional Considerations:

Linearity Assumption: Linear regression assumes a linear relationship between features and the target variable. If the data exhibits non-linearity, consider alternative models like decision trees, support vector machines, or neural networks.
Feature Scaling: For improved performance, especially when features have different scales or units, feature scaling (e.g., standardization or normalization) is often recommended.
Regularization: Techniques like L1 (LASSO) or L2 (Ridge) regularization can help address overfitting (a model that performs well on training data but poorly on unseen data) by introducing penalties for large coefficients. The Ridge and Lasso classes in scikit-learn provide options for regularized regression.
Other Class Modules in scikit-learn:

LogisticRegression: For classification tasks involving binary or multiple classes, LogisticRegression implements logistic regression to model the probability of a sample belonging to a particular class.
Ridge and Lasso: These classes enable regularized linear regression with L2 (Ridge) and L1 (LASSO) penalties, respectively, useful for reducing overfitting and coefficient shrinkage for sparsity.
RidgeClassifier and LassoCV: These offer classification and regression capabilities with regularization, respectively.
LinearSVC: For Support Vector Classification (SVC) with a linear kernel, LinearSVC is efficient and suitable for high-dimensional data.
Perceptron: This class implements the Perceptron algorithm, a simple linear classifier for binary classification.
SGDClassifier: For Stochastic Gradient Descent (SGD) classification with various loss functions (e.g., hinge loss for linear SVC), SGDClassifier is flexible for large datasets.
RandomizedLogisticRegression: Useful for high-dimensional sparse datasets, this class implements stochastic optimization for logistic regression.
HuberRegressor: For robust regression less sensitive to outliers, HuberRegressor employs a smooth combination of least squares and L1 losses.
ElasticNet: Combining L1 and L2 regularization, ElasticNet can handle sparse data with potentially correlated features.
By understanding these classes and their capabilities, you can effectively choose the appropriate linear or regularized linear model for your specific machine learning tasks in scikit-learn.



**Logistic regression:**
Logistic regression, or 'logit regression', is a supervised learning algorithm used for binary classification, such as deciding whether an image fits into one class.

Originating from statistics, logistic regression technically predicts the probability that an input can be categorised into a single primary class. In practice, however, this can be used to group outputs into one of two categories ('the primary class' or 'not the primary class'). This is achieved by creating a range for binary classification, such as any output between 0-.49 is put in one group, and any between .50 and 1.00 is put in another.

As a result, logistic regression in machine learning is typically used for binary categorisation rather than predictive modelling.

Logistic Regression for Binary Classification

Logistic regression is a powerful supervised learning algorithm used for binary classification. It estimates the probability of an event (represented by the dependent variable) occurring based on one or more independent variables (features). In simpler terms, it predicts the likelihood of something belonging to one of two categories.

Key Concepts:

Binary Classification: Logistic regression is ideal for problems where the target variable can have only two possible outcomes, typically labeled as 0 and 1 (e.g., email spam or not spam, disease present or absent).
Logistic Function (Sigmoid): This function transforms the linear combination of features into a probability value between 0 and 1. A value closer to 1 indicates a higher probability of belonging to class 1, and vice versa.
Model Fitting: The model learns the coefficients for the linear equation using techniques like gradient descent to minimize the difference between predicted probabilities and actual class labels.
Code Example (Python with scikit-learn):

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# Sample data (replace with your actual data)
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])  # Features
y = np.array([0, 1, 1, 0])  # Target labels (0 or 1)

# Create and train the model
model = LogisticRegression()
model.fit(X, y)

# Make predictions
new_data = np.array([[9, 10]])  # Example: Predict for new features
predicted_probability = model.predict_proba(new_data)  # Get probabilities for both classes
print(predicted_probability)  # Output: [[..., 0.99..., 0.01...]] (probability of class 1, probability of class 0)

# Get class predictions (thresholding the probability)
predicted_class = model.predict(new_data)
print(predicted_class)  # Output: [1] (predicted class: 1)


Explanation:

Import libraries: numpy for numerical computations and LogisticRegression from scikit-learn.
Sample data: Replace with your actual data. X is a 2D array for features, and y is a 1D array for target labels (0 or 1).
Model creation: Instantiate a LogisticRegression object.
Model training: Use model.fit(X, y) to train the model on the data.
Prediction (Probabilities): Employ model.predict_proba(new_data) to get probability values for both classes (class 1 and class 0) for new features in new_data.
Prediction (Class Labels): Utilize model.predict(new_data) to obtain the predicted class label (0 or 1) for the new data. Here, a threshold (often 0.5) is often applied to the predicted probability to convert it to a class label (e.g., if probability >= 0.5, predict class 1).
Remember: This is a simplified explanation for binary classification. Logistic regression can be extended to multi-class problems, but the interpretation becomes more complex.



**Here's a more complex example of logistic regression with detailed comments for each line, incorporating multi-class classification and hyperparameter tuning:
**
Scenario: Predicting whether an email is spam (class 1) or not spam (class 0) based on word frequencies in the email. We'll use a dataset of pre-processed emails with word frequencies as features and corresponding spam/not spam labels.

Code (Python with scikit-learn):



In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Load data (replace with your actual data loading method)
data = pd.read_csv("email_data.csv")
X = data["text"]  # Feature: Email text
y = data["label"]  # Target variable: Spam (1) or not spam (0)

# Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction: TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000)  # Extract 1000 most informative features
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Logistic regression model (multi-class)
model = LogisticRegression(multi_class="multinomial", solver="lbfgs", random_state=42)  # Set solver and random state

# Hyperparameter tuning (optional)
# Use techniques like GridSearchCV for more extensive tuning
model.set_params(C=1.0)  # Regularization parameter (example)

# Model training
model.fit(X_train_tfidf, y_train)

# Prediction on test data
y_pred = model.predict(X_test_tfidf)

# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print(f"Accuracy: {accuracy:.4f}")
print(f"F1-score (weighted): {f1:.4f}")


Explanation:

Import libraries:

pandas for data manipulation (assuming CSV data).
train_test_split from sklearn.model_selection for splitting data into training and testing sets.
TfidfVectorizer from sklearn.feature_extraction.text for feature extraction using TF-IDF (Term Frequency-Inverse Document Frequency).
LogisticRegression from sklearn.linear_model for the logistic regression model.
accuracy_score and f1_score from sklearn.metrics for evaluation metrics.
Data loading: Replace with your data loading method (e.g., reading CSV).

X stores the email text data (features).
y stores the class labels (spam or not spam).
Train-test split: Splits data into training (80%) and testing (20%) sets using train_test_split.

random_state=42 sets a seed for reproducibility.
Feature extraction:

TfidfVectorizer creates features by analyzing word frequencies and their importance across all emails.
max_features=1000 limits the number of features to extract (you can experiment with different values).
X_train_tfidf and X_test_tfidf are the transformed training and testing features (numerical representations of the text).
Logistic regression model:

LogisticRegression is instantiated with the following parameters:
multi_class="multinomial": Specifies multi-class classification (more than two classes).
solver="lbfgs": Optimization algorithm (others like "liblinear" might be suitable depending on dataset size).
random_state=42: Sets a seed for reproducibility.
Hyperparameter tuning (optional):

Hyperparameter tuning involves optimizing model parameters for better performance.
Here, we're setting the regularization parameter C (example), but consider using techniques like GridSearchCV for a

lets get deeper into depedancies

In [None]:
from sklearn.model_selection import train_test_split

In machine learning, train_test_split is a crucial function from the sklearn.model_selection library. It's used to split your dataset into two essential parts: training data and testing data. Here's a breakdown of its purpose and usage:

What it Does:

Divides your dataset into two subsets:
Training set (usually 70-80%): This larger portion of the data is used to train your machine learning model. The model learns patterns and relationships within the training data to build its knowledge base.
Testing set (usually 20-30%): This smaller portion remains unseen by the model during training. It's used to evaluate the model's generalizability and performance on unseen data.
Why It's Important:

Prevents Overfitting: Without splitting the data, your model might simply memorize the training examples, leading to poor performance on new data (overfitting). Testing data helps assess how well your model generalizes to unseen examples.
Provides a Reliable Performance Estimate: By evaluating the model on unseen data (testing set), you obtain a more accurate measure of its effectiveness in real-world scenarios.
How to Use It:

The train_test_split function takes several arguments:

X: The feature matrix or array containing your data samples (usually rows represent samples and columns represent features).
y: The target vector or array containing the corresponding labels for each sample in X.
test_size (default: 0.25): The proportion of data to allocate to the testing set (0.0 to 1.0).
random_state (default: None): Controls the randomness for shuffling data (used for reproducibility).
shuffle (default: True): Whether to shuffle the data before splitting.
Here's an example of how to use train_test_split with sample data:

In [14]:
from sklearn.model_selection import train_test_split

# Sample data (replace with your actual data)
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]]
y = [1, 0, 1, 0, 1, 0]

# Split data into training and testing sets (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=True)

print(f"X_train: {X_train}\nX_test: {X_test}\ny_train: {y_train}\ny_test: {y_test}")


X_train: [[1, 2], [5, 6], [7, 8], [3, 4]]
X_test: [[11, 12], [9, 10]]
y_train: [1, 1, 0, 0]
y_test: [0, 1]


In this example, the code splits the data into training and testing sets with a 75% (training) to 25% (testing) ratio. You can adjust the test_size parameter to control the split ratio based on your needs.

Remember, splitting your data using train_test_split is a fundamental step in any machine learning workflow. It ensures your model is trained on relevant data and generalizes well to unseen examples.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

TfidfVectorizer: Transforming Text Data into Numerical Features

In machine learning, particularly for text classification tasks, we often deal with textual data. However, machine learning models can't directly process text. TfidfVectorizer from sklearn.feature_extraction.text helps bridge this gap by transforming text data into numerical features suitable for model training. It employs a technique called TF-IDF (Term Frequency-Inverse Document Frequency).

How TF-IDF Works:

Term Frequency (TF): This measures how frequently a term (word) appears in a specific document.
Inverse Document Frequency (IDF): This considers the overall importance of a term across all documents in the corpus. Words that appear frequently across all documents (e.g., "the", "a") will have lower IDF scores, while words that are specific to a few documents will have higher IDF scores.
By combining TF and IDF, TfidfVectorizer identifies the most informative words within a document and downplays the importance of common words. This helps create a meaningful numerical representation of the text data.

Code Example:

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text documents
documents = ["Georgia has been inhabited since prehistoric times, hosting the world's earliest known sites of winemaking, gold mining, and textiles.[15][16] The classical era saw the emergence of several kingdoms, such as Colchis and Iberia, that formed the nucleus of the modern Georgian state. In the early fourth century, Georgians officially adopted Christianity, which contributed to their gradual unification and ethnogenesis. In the High Middle Ages, the Kingdom of Georgia reached its Golden Age during the reign of King David IV and Queen Tamar. The kingdom subsequently declined and disintegrated under the hegemony of various regional",
               "Georgia has been inhabited since prehistoric times,fter the Russian Revolution in 1917, Georgia briefly emerged as an independent republic under German protectorate,[17] but was invaded and annexed by the Soviet Union in 1922, becoming one of its constituent republics. In the 1980s, an independence movement grew quickly, leading to Georgia's secession from the Soviet Union in April 1991."]

# Create a TF-IDF vectorizer (extract 5 most informative features)
vectorizer = TfidfVectorizer(max_features=50)

# Transform documents into numerical features (TF-IDF vectors)
features = vectorizer.fit_transform(documents)

# Print feature names and their corresponding values in the TF-IDF vectors
feature_names = vectorizer.get_feature_names_out()
print(f"Feature names: {feature_names}\nFeatures:\n{features.toarray()}")


Feature names: ['an' 'and' 'as' 'been' 'georgia' 'has' 'in' 'inhabited' 'invaded' 'its'
 'iv' 'king' 'kingdom' 'kingdoms' 'known' 'leading' 'middle' 'mining'
 'modern' 'movement' 'nucleus' 'of' 'officially' 'one' 'prehistoric'
 'protectorate' 'queen' 'quickly' 'russian' 'saw' 'secession' 'several'
 'since' 'sites' 'soviet' 'state' 'subsequently' 'such' 'tamar' 'textiles'
 'the' 'their' 'times' 'to' 'under' 'unification' 'union' 'various' 'was'
 'which']
Features:
[[0.         0.31537198 0.0630744  0.0630744  0.12614879 0.0630744
  0.12614879 0.0630744  0.         0.0630744  0.08864886 0.08864886
  0.17729772 0.08864886 0.08864886 0.         0.08864886 0.08864886
  0.08864886 0.         0.08864886 0.37844637 0.08864886 0.
  0.0630744  0.         0.08864886 0.         0.         0.08864886
  0.         0.08864886 0.0630744  0.08864886 0.         0.08864886
  0.08864886 0.08864886 0.08864886 0.08864886 0.69381835 0.08864886
  0.0630744  0.0630744  0.0630744  0.08864886 0.         0.088648

Explanation:

Import: We import TfidfVectorizer from sklearn.feature_extraction.text.
Sample Data: We create a list of sample text documents.
Create Vectorizer: We instantiate a TfidfVectorizer object, specifying max_features=5 to extract the top 5 most informative features (you can adjust this parameter).
Transform Documents: The fit_transform method both fits the vectorizer to the data (learns the vocabulary) and transforms the documents into TF-IDF vectors (numerical features).
Print Results: We print the extracted feature names and their corresponding values in the TF-IDF vectors.
This code demonstrates how TfidfVectorizer helps convert textual data into a numerical representation that machine learning models can understand and process effectively.

In [None]:
from sklearn.metrics import accuracy_score, f1_score

Explanation:

Importing from sklearn.metrics:

This line imports two commonly used evaluation metrics from the sklearn.metrics module:

accuracy_score: This function calculates the accuracy, which is the proportion of correctly classified samples. It's a basic metric but can be misleading in certain cases (e.g., imbalanced class distributions).
f1_score: This function calculates the F1-score, a harmonic mean of precision and recall. It's often a more robust metric than accuracy, especially for imbalanced datasets.
How They Work:

Accuracy:

It's calculated as the number of correctly predicted samples divided by the total number of samples:

In [None]:
# accuracy = (True Positives + True Negatives) / (Total Samples)

F1-score:

It's the harmonic mean of precision and recall, where:

Precision: Proportion of true positives among predicted positives.
Recall: Proportion of true positives among actual positives.
F1-score considers both precision and recall, making it a more balanced measure of a model's performance. It's calculated as:

In [9]:
# F1-score = 2 * (Precision * Recall) / (Precision + Recall)

In [10]:
from sklearn.metrics import accuracy_score, f1_score

# Sample predictions (replace with your actual predictions)
y_true = [1, 0, 1, 0, 1]  # True labels
y_pred = [0, 0, 1, 1, 0]  # Predicted labels

# Calculate accuracy and F1-score
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average="weighted")  # Weighted average for multi-class

print(f"Accuracy: {accuracy:.4f}")
print(f"F1-score (weighted): {f1:.4f}")


Accuracy: 0.4000
F1-score (weighted): 0.4000


Explanation of the Code:

We import accuracy_score and f1_score from sklearn.metrics.
We define sample lists for true labels (y_true) and predicted labels (y_pred).
We calculate accuracy using accuracy_score.
We calculate F1-score using f1_score with the average parameter set to "weighted" (suitable for multi-class classification).
We print the accuracy and F1-score values.
In Conclusion:

accuracy_score and f1_score provide essential tools for evaluating the performance of your machine learning models. While accuracy is a basic starting point, F1-score often offers a more comprehensive assessment, especially for imbalanced datasets. Remember to choose the metric(s) that best align with your specific problem and its evaluation criteria.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data (replace with your actual data)
emails = ["Buy this new product!", "Important meeting tomorrow", "Free money! Click here!", "Company update"]
labels = [1, 0, 1, 0]  # 1: Spam, 0: Not Spam

# Feature extraction: Count occurrences of words in subject lines
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)

# Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Logistic regression model for classification
model = LogisticRegression(multi_class="ovr", solver="lbfgs", random_state=42)
model.fit(X_train, y_train)

# Prediction on test data
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")

# Example prediction on a new email
new_email = "hello my name is yke!"
new_features = vectorizer.transform([new_email])  # Transform new email
prediction = model.predict(new_features)[0]

if prediction == 1:
  print(f"Prediction: This email is likely spam.")
else:
  print(f"Prediction: This email is likely not spam.")


Accuracy: 0.0000
Prediction: This email is likely spam.
