Supervised algorithms include classification and regression. First, we will start with regression algorithms: Linear Regression and Polynomial Regression.

____Linear Regression Summary

Definition: A supervised learning algorithm that models the relationship between an independent variable (X) and a dependent variable (Y) using a straight line.

Equation:

Types: Simple Linear Regression (one independent variable), Multiple Linear Regression (multiple independent variables).

Uses: Predicting house prices, estimating sales based on ads, analyzing salary vs. experience.

Evaluation Metrics: MSE (Mean Squared Error), R² Score (Explains variance in data).

Challenges: Autocorrelation (pattern in errors), multicollinearity (high correlation between variables), bias-variance tradeoff (balancing model complexity).

In [1]:
#linearregresion
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
df=pd.read_csv('/content/IRIS.csv')
df=pd.get_dummies(df)
x=df.drop(columns=['sepal_width'])
y=df['sepal_width'].values.reshape(-1,1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
object_=LinearRegression()
train=object_.fit(x_train,y_train)
y_pred=train.predict(x_test)
mse=mean_squared_error(y_test,y_pred)
r2=r2_score(y_test,y_pred)
print(f' mean_squared_error={mse},r2_score= {r2}')

 mean_squared_error=0.052937128216538086,r2_score= 0.7268309420624719


Polynomial Regression Summary

Definition: An extension of linear regression that models the relationship between variables using a polynomial equation instead of a straight line.

Equation:

Purpose: Captures non-linear relationships that cannot be represented by simple linear regression.

Uses: Predicting complex trends, modeling growth patterns, analyzing non-linear data relationships.

Evaluation Metrics: MSE (Mean Squared Error), R² Score (Measures how well the model fits the data).

Challenges: Overfitting with high-degree polynomials, sensitivity to noise, computational complexity.

In [2]:
import pandas  as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error,r2_score
df=pd.read_csv('/content/IRIS.csv')
df=pd.get_dummies(df,dtype=int)
df=df.dropna()
x=df.drop(columns=['petal_width'])
y=df['petal_width'].values.reshape(-1,1)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
ploy=PolynomialFeatures(degree=2) # to x,x2 are new features
x_train_ploy=ploy.fit_transform(x_train)
linear=LinearRegression()
train=linear.fit(x_train_ploy,y_train)
test=ploy.transform(x_test)
y_pred=train.predict(test)
mse=mean_squared_error(y_test,y_pred)
r2=r2_score(y_test,y_pred)
print(mse,r2)

0.032200209797055214 0.9369995895275007


now we will start with clssification algorithms

Logistic Regression Summary

Definition: A supervised learning algorithm used for binary classification, predicting probabilities of categorical outcomes.

Equation:

Purpose: Estimates the probability of an event occurring and applies a threshold (e.g., 0.5) for classification.

Uses: Spam detection, disease diagnosis, customer churn prediction.

Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC.

Challenges: Assumes a linear relationship between features and log-odds, sensitive to outliers, not suitable for complex non-linear patterns.

In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df=pd.read_csv('/content/cars.csv')
df=df.dropna()
df = pd.get_dummies(df, dtype=int)
x=df.drop(columns=['class_unacc','class_acc','class_good','class_vgood'])
y=df['class_unacc'].to_numpy()
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
cl=LogisticRegression()
train=cl.fit(x_train,y_train)
y_pred=cl.predict(x_test)
acc=accuracy_score(y_test,y_pred)
print(acc)

0.9441233140655106


Decision Tree Summary

Definition: A supervised learning algorithm used for classification and regression by splitting data into branches based on feature conditions.

Structure: Consists of nodes (decision points), branches (choices), and leaves (final outcomes).

Purpose: Creates a flowchart-like model to make decisions based on feature values.

Uses: Medical diagnosis, credit risk assessment, customer segmentation.

Evaluation Metrics: Accuracy, Gini Impurity, Entropy, Mean Squared Error (for regression).

Challenges: Prone to overfitting, sensitive to noisy data, biased towards dominant classes.

In [4]:
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df=pd.read_csv('/content/IRIS.csv')
df=df.dropna()
df2=pd.get_dummies(df['species'])
x=df.drop(columns=['species'])
y=df2
x_train,x_test,y_train,y_test=train_test_split(x , y,test_size=0.2,random_state=42)
model=tree.DecisionTreeClassifier(max_depth=3)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)

0.9666666666666667


Random Forest Summary

Definition: An ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.

How It Works: Creates multiple decision trees on random subsets of data and averages their predictions (for regression) or uses majority voting (for classification).

Purpose: Increases model stability and generalization by reducing variance.

Uses: Fraud detection, medical diagnosis, recommendation systems.

Evaluation Metrics: Accuracy, Precision, Recall, F1-Score, Mean Squared Error (for regression).

Challenges: Computationally expensive, less interpretable than a single decision tree, may struggle with high-dimensional sparse data.

In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df=pd.read_csv('/content/emails.csv')
df=df.fillna(df.mode().iloc[0])
df=pd.get_dummies(df)
x=df.drop(columns=['Email No.','Prediction'],errors='ignore')
y=df['Prediction']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
model=RandomForestClassifier(n_estimators=100,max_depth=10, random_state=0)
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)

0.9130434782608695


Naïve Bayes Summary

Definition: A probabilistic classification algorithm based on Bayes' Theorem, assuming feature independence.

Equation:


P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)}

Gaussian Naïve Bayes: For continuous data (assumes normal distribution).

Multinomial Naïve Bayes: For text classification (word frequencies).

Bernoulli Naïve Bayes: For binary feature data.

Uses: Spam detection, sentiment analysis, medical diagnosis.

Evaluation Metrics: Accuracy, Precision, Recall, F1-Score.

Challenges: Assumes independence between features (rarely true in real data), sensitive to imbalanced data.

In [6]:
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df=pd.read_csv('/content/emails.csv')
df=pd.get_dummies(df)
x = df.drop(columns=['Email No.', 'Prediction'], errors='ignore')
y=df['Prediction']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
model=GaussianNB()
model.fit(x_train,y_train)
y_pred=model.predict(x_test)
accuracy=accuracy_score(y_test,y_pred)
print(accuracy)

0.9410628019323671
