In [None]:
#1.How does regularization (L1 and L2) help in preventing overfitting?
L1 and L2 regularization are techniques used in machine learning to prevent overfitting.
Overfitting is a phenomenon where the model learns the noise in the data instead of the underlying patterns, which leads to poor performance on new, unseen data.

L1 regularization (Lasso) adds a penalty to the model based on the **absolute value** of the weights. This encourages the model to ignore less important features,
effectively selecting only the most relevant ones. As a result, the model becomes simpler and less likely to overfit.

L2 regularization (Ridge) adds a penalty based on the **square** of the weights.
This encourages the model to spread the importance evenly across all features, preventing it from relying too much on any one feature and helping reduce overfitting.

L1 and L2 regularization can prevent overfitting by reducing the complexity of the model and distributing the weights more evenly across all the features.

In [None]:
#2.Why is feature scaling important in gradient descent?
Feature scaling is important in gradient descent because it helps the algorithm learn faster and more efficiently by ensuring all features are on the same scale.

Faster Convergence: Features with different scales can cause uneven updates, slowing down the learning. Scaling makes the updates more consistent and speeds up convergence.

Prevents Skewed Updates: Without scaling, features with larger values can dominate the learning process, making updates unstable. Scaling ensures each feature has a similar impact.

Improves Optimization: Scaling helps the algorithm find the best solution more easily by avoiding issues caused by features with very different magnitudes.

Min-Max Scaling: Rescales features to a range (e.g., 0 to 1).
Standardization: Centers features around 0 with a standard deviation of 1.
In short, feature scaling helps gradient descent work more effectively and quickly by treating all features equally.

In [None]:
# Problem Solving
#1.Given a dataset with missing values, how would you handle them before training an ML model?

If the missing data is small - Dropping or imputing with the mean/median can work.
If the missing data is substantial - Imputation with advanced methods (regression) or using decision tree models that handle missing data natively might be better.
If missing data is not random - Consider treating the missing values as a separate category or using a predictive model for imputation.
the method to handle missing data depends on the amount of missing data, the type of data (numerical or categorical), and the algorithm you're using.

In [None]:
#2 Design a pipeline for building a classification model. Include steps for data preprocessing.

1 Import needed libraries
2 Load the data set - data = pd.read_csv('your_dataset.csv')
3 Data Preprocessing - data cleaning, Handling missing values Fill missing values with mean or mode etc
4 Split the data into training and testing sets
5 Train the model using a classification algorithm (e.g., Random Forest)
6 Check the models accuracy on the test data
This pipeline covers the essential steps to clean, train, evaluate, and save a classification model in a simple and efficient way.

In [2]:
#Coding
#1.Write a Python script to implement a decision tree classifier using Scikit-learn.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset (for demonstration, we will use the Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Fit the model on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 1.00
Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [1]:
#2 Given a dataset, write code to split the data into training and testing sets using an 80-20 split.

import pandas as pd
from sklearn.model_selection import train_test_split

# Sample dataset creation
data = {
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
    'label': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Splitting the dataset into training and testing sets
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

# Displaying the results
print("Training Set:")
print(train_set)
print("\nTesting Set:")
print(test_set)


Training Set:
   feature1  feature2  label
5         6         5      1
0         1        10      0
7         8         3      1
2         3         8      0
9        10         1      1
4         5         6      0
3         4         7      1
6         7         4      0

Testing Set:
   feature1  feature2  label
8         9         2      0
1         2         9      1


In [None]:
# Case Study
# A company wants to predict employee attrition. What kind of ML problem is this? Which algorithms would you choose and why?

This is a binary classification problem. The objective is to predict whether an employee will leave the company (attrition = 1) or stay (attrition = 0)
based on various features (e.g., age, salary, department, job satisfaction, years at the company).

For predicting employee attrition, Logistic Regression is a good starting point because it's simple and interpretable. If you want better performance, try Decision Trees.