<a href="https://colab.research.google.com/github/Harshavardhan264/Basic-ML/blob/main/MLworkshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Classification Models
Build and evaluate Logistic Regression, Naïve Bayes, KNN, Decision Tree, and Support Vector Classifier models on a suitable dataset. Preprocess the data and report Accuracy, Precision, Recall, and F1-Score for each model on the test data

#Importing the required libraries and Loading the Data
The required libraries are imported to handle data manipulation, preprocessing, and model training. Libraries such as pandas and numpy are used for data handling and numerical operations, while sklearn provides tools for splitting the data, encoding categorical features, and building regression models. The dataset, 50_Startups.csv is then loaded into a pandas DataFrame, which will be used for training and testing machine learning models

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
data = pd.read_csv('/content/50 Start Up.zip')
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#Exploring the data
Missing values in the dataset are checked using the isnull().sum() function, which calculates the total number of missing values for each column. The data.info() method is used to display information about the dataset, including data types and the number of non-null entries in each column. Summary statistics for the numerical columns are generated using data.describe(), providing insights into key metrics such as the mean, standard deviation, minimum, and maximum values, helping to understand the distribution of the data

In [2]:
missing = data.isnull().sum()
print(missing)
print(data.info())
print(data.describe())

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
None
           R&D Spend  Administration  Marketing Spend         Profit
count      50.000000       50.000000        50.000000      50.000000
mean    73721.615600   121344.639600    211025.097800  112012.639200
std     45902.256482    28017.802755    122290.310726   40306.180338
min         0.000000    51283.140000         0.000000   14681.400000
25%     39936.370000   103730.875000    129300.132500   

#Encoding Categorical Variable 'State'
The categorical variable "State" is handled using one-hot encoding with the OneHotEncoder from sklearn. The drop='first' argument ensures that the first category is dropped to avoid multicollinearity. The encoded "State" feature is then added to the dataset, and the original "State" column is removed. The resulting encoded data is stored in a new DataFrame, data_processed, which is displayed using head() to show the first few rows of the updated dataset, now containing the numerical representation of the "State" variable

In [3]:
categorical_features = ['State']
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_state = encoder.fit_transform(data[categorical_features])

encoded_state_df = pd.DataFrame(encoded_state, columns=encoder.get_feature_names_out(categorical_features))
data_processed = pd.concat([data.drop(columns=categorical_features), encoded_state_df], axis=1)
data_processed.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit,State_Florida,State_New York
0,165349.2,136897.8,471784.1,192261.83,0.0,1.0
1,162597.7,151377.59,443898.53,191792.06,0.0,0.0
2,153441.51,101145.55,407934.54,191050.39,1.0,0.0
3,144372.41,118671.85,383199.62,182901.99,0.0,1.0
4,142107.34,91391.77,366168.42,166187.94,1.0,0.0


#Defining the Features (X)
The features (X) are defined by dropping the target variable "Profit" from the data_processed dataset. This results in a DataFrame X that contains all the independent variables (or predictors) for the model. The head() function is used to display the first few rows of X, providing a preview of the feature set that will be used for training the regression models.

In [4]:
X = data_processed.drop(columns=['Profit'])
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0.0,1.0
1,162597.7,151377.59,443898.53,0.0,0.0
2,153441.51,101145.55,407934.54,1.0,0.0
3,144372.41,118671.85,383199.62,0.0,1.0
4,142107.34,91391.77,366168.42,1.0,0.0


#Converting Profit into Categorical Labels
Transform the Profit column into categorical labels by binning the values into predefined ranges such as Low, Medium, and High. This step helps in converting a continuous target variable into discrete categories for classification tasks

In [5]:
bins = [0, 50000, 100000, np.inf]
labels = ['Low', 'Medium', 'High']
y = pd.cut(data_processed['Profit'], bins=bins, labels=labels)
y.head()

Unnamed: 0,Profit
0,High
1,High
2,High
3,High
4,High


#Splitting the Data
The data is split into training and test sets using the train_test_split function from sklearn. The independent variables (X) and the target variable (y) are divided into training and testing subsets. The test_size=0.2 argument indicates that 20% of the data will be used for testing, while the remaining 80% will be used for training the model. The random_state=42 ensures reproducibility of the split. The head() function is used to display the first few rows of the training features (X_train), providing a preview of the data used to train the model

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_Florida,State_New York
12,93863.75,127320.38,249839.44,1.0,0.0
4,142107.34,91391.77,366168.42,1.0,0.0
37,44069.95,51283.14,197029.42,0.0,0.0
8,120542.52,148718.95,311613.29,0.0,1.0
3,144372.41,118671.85,383199.62,0.0,1.0


#Scaling the Data
Use a standard scaler to normalize the feature values for consistent ranges, improving model performance. The scaler is fitted on the training data and applied to both training and test datasets to ensure consistent scaling

In [7]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)
X_test_scaled

array([[ 0.30245367,  0.52942836,  0.14916233, -0.73379939, -0.69388867],
       [-0.82734624, -1.40769369, -0.53560477, -0.73379939, -0.69388867],
       [-0.33181874, -0.20294703, -1.27505783,  1.36277029, -0.69388867],
       [-1.62147425,  0.11103854, -2.06176266, -0.73379939,  1.44115338],
       [ 0.35879726,  0.88291223,  0.41286919, -0.73379939,  1.44115338],
       [-1.63116196, -2.56004955, -2.07854935, -0.73379939,  1.44115338],
       [-0.04987791,  0.84817808, -0.89664846,  1.36277029, -0.69388867],
       [-0.2753597 ,  0.67912498, -0.86215204, -0.73379939, -0.69388867],
       [-0.30191325,  0.29793642, -1.67222209, -0.73379939, -0.69388867],
       [ 0.18462534,  1.19412269, -2.07854935, -0.73379939,  1.44115338]])

#Training and Evaluating Logistic Regression Model
Train a Logistic Regression model with balanced class weighting to handle imbalanced datasets effectively. After training, predict the test data and evaluate the model using metrics such as Precision, Recall, F1-Score, and Accuracy

In [8]:
logistic_model_balanced = LogisticRegression(class_weight='balanced')
logistic_model_balanced.fit(X_train_scaled, y_train)

y_pred_logistic = logistic_model_balanced.predict(X_test_scaled)

print("Logistic Regression with Class Weighting:")
print(classification_report(y_test, y_pred_logistic,zero_division=0))

Logistic Regression with Class Weighting:
              precision    recall  f1-score   support

        High       1.00      0.60      0.75         5
         Low       0.00      0.00      0.00         1
      Medium       0.57      1.00      0.73         4

    accuracy                           0.70        10
   macro avg       0.52      0.53      0.49        10
weighted avg       0.73      0.70      0.67        10



#Training and Evaluating Naïve Bayes Model
Train a Gaussian Naïve Bayes model on the dataset. Since Naïve Bayes does not require feature scaling, it directly uses the raw data. Predict the test data and evaluate the model using metrics such as Precision, Recall, F1-Score, and Accuracy.

In [9]:
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

print("Naive Bayes:")
print(classification_report(y_test, y_pred_nb))

Naive Bayes:
              precision    recall  f1-score   support

        High       1.00      1.00      1.00         5
         Low       0.00      0.00      0.00         1
      Medium       0.75      0.75      0.75         4

    accuracy                           0.80        10
   macro avg       0.58      0.58      0.58        10
weighted avg       0.80      0.80      0.80        10



#Training and Evaluating K-Nearest Neighbors (KNN) Model
Train a K-Nearest Neighbors (KNN) model with a default value of k=5. Feature scaling is applied to ensure consistent distance measurements. Predict the test data and evaluate the model using metrics such as Precision, Recall, F1-Score, and Accuracy.

In [10]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_scaled, y_train)

y_pred_knn = knn_model.predict(X_test_scaled)

print("K-Nearest Neighbors:")
print(classification_report(y_test, y_pred_knn))

K-Nearest Neighbors:
              precision    recall  f1-score   support

        High       1.00      0.60      0.75         5
         Low       0.00      0.00      0.00         1
      Medium       0.50      0.75      0.60         4

    accuracy                           0.60        10
   macro avg       0.50      0.45      0.45        10
weighted avg       0.70      0.60      0.61        10



#Training and Evaluating Decision Tree Model
Train a Decision Tree model with a specified maximum depth to prevent overfitting and improve generalization. Predict the test data and evaluate the model using metrics such as Precision, Recall, F1-Score, and Accuracy

In [11]:
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)

print("Decision Tree:")
print(classification_report(y_test, y_pred_dt))

Decision Tree:
              precision    recall  f1-score   support

        High       0.83      1.00      0.91         5
         Low       0.50      1.00      0.67         1
      Medium       1.00      0.50      0.67         4

    accuracy                           0.80        10
   macro avg       0.78      0.83      0.75        10
weighted avg       0.87      0.80      0.79        10



#Training and Evaluating Support Vector Classifier (SVC) Model
Train a Support Vector Classifier (SVC) using the radial basis function (RBF) kernel for non-linear classification. Feature scaling is applied for better performance. Predict the test data and evaluate the model using metrics such as Precision, Recall, F1-Score, and Accuracy.

In [12]:
svc_model = SVC(kernel='rbf', random_state=42)
svc_model.fit(X_train_scaled, y_train)

y_pred_svc = svc_model.predict(X_test_scaled)

print("Support Vector Classifier:")
print(classification_report(y_test, y_pred_svc,zero_division=0))

Support Vector Classifier:
              precision    recall  f1-score   support

        High       1.00      0.60      0.75         5
         Low       0.00      0.00      0.00         1
      Medium       0.57      1.00      0.73         4

    accuracy                           0.70        10
   macro avg       0.52      0.53      0.49        10
weighted avg       0.73      0.70      0.67        10



#Comparing and Summarizing Evaluation Metrics
Aggregate the evaluation metrics (Accuracy, Precision, Recall, and F1-Score) for all models into a structured format. Generate a summary DataFrame to compare the performance of Logistic Regression, Naïve Bayes, KNN, Decision Tree, and Support Vector Classifier based on weighted averages of the metrics

In [13]:
results = {
    "Logistic Regression": classification_report(y_test, y_pred_logistic, output_dict=True,zero_division=0),
    "Naive Bayes": classification_report(y_test, y_pred_nb, output_dict=True,zero_division=0),
    "KNN": classification_report(y_test, y_pred_knn, output_dict=True,zero_division=0),
    "Decision Tree": classification_report(y_test, y_pred_dt, output_dict=True,zero_division=0),
    "SVC": classification_report(y_test, y_pred_svc, output_dict=True,zero_division=0),
}

result_df = pd.DataFrame({
    model: {
        "Accuracy": metrics["accuracy"],
        "Precision (Weighted)": metrics["weighted avg"]["precision"],
        "Recall (Weighted)": metrics["weighted avg"]["recall"],
        "F1-Score (Weighted)": metrics["weighted avg"]["f1-score"]
    } for model, metrics in results.items()
})

result_df

Unnamed: 0,Logistic Regression,Naive Bayes,KNN,Decision Tree,SVC
Accuracy,0.7,0.8,0.6,0.8,0.7
Precision (Weighted),0.728571,0.8,0.7,0.866667,0.728571
Recall (Weighted),0.7,0.8,0.6,0.8,0.7
F1-Score (Weighted),0.665909,0.8,0.615,0.787879,0.665909
