###Algorithm questions


1.How does regularization (L1 and L2) help in preventing overfitting?

Ans: Regularization helps prevent overfitting by adding a penalty to the model's complexity, discouraging it from relying too much on specific features.The two common types of regularization are L1 regularization (Lasso) and L2 regularization (Ridge).



*   L1 simplifies the model by eliminating irrelevant features.
*   L2 keeps all features but reduces their influence, leading to a smoother, generalized model.




2.Why is feature scaling important in gradient descent?

Ans: Feature scaling ensures that all features contribute equally in gradient descent by putting them on the same scale. This prevents larger features from dominating, speeds up convergence, and avoids calculation issues.

Common Scaling Methods:

1. Min-Max Scaling: Makes features range from 0 to 1.

2. Standardization: Adjusts features to have a mean of 0 and a standard deviation of 1.

###Problem Solving


1.Given a dataset with missing values, how would you handle them before training an ML model?

Ans: Handling missing values before training a machine learning model involves the following :

        # Check for missing values
        print(data.isnull().sum())

        # Remove rows with missing values
        data_cleaned_raws = data.dropna()
        
        # Remove columns with too many missing values
        data_cleaned_cols = data.dropna(axis=1)

        # Fill missing numerical values with the mean
        data['numerical_column'] = data['numerical_column'].fillna(data['numerical_column'].mean())

        # Fill missing categorical values with the mode
        data['categorical_column'] = data['categorical_column'].fillna(data['categorical_column'].mode()[0])

        # Forward fill for time-series or sequential data
        data['time_series_column'] = data['time_series_column'].fillna(method='ffill')

        # Backward fill for time-series or sequential data
        data['time_series_column'] = data['time_series_column'].fillna(method='bfill')

        # Verify no missing values remain
        print(data.isnull().sum())

2.Design a pipeline for building a classification model. Include steps for data preprocessing.

Ans: 1. Data Preprocessing

a. Data Loading
Load the dataset using pandas or another library.

b. Exploratory Data Analysis (EDA)
Check for missing values, data distribution, outliers, and class balance.
Use data.info(), data.describe(), and visualization libraries like matplotlib or seaborn.

c. Handle Missing Values
Impute missing values using mean, median, mode, or advanced techniques.

d. Encode Categorical Features
Convert categorical variables to numerical ones using:
Label Encoding: For ordinal categories.
One-Hot Encoding: For nominal categories.

e. Feature Scaling
Standardize or normalize numerical features for gradient-based algorithms (e.g., Logistic Regression, SVM).

f. Feature Selection
Use correlation analysis, feature importance, or PCA to select the most relevant features.

2. Model Building

a. Train-Test Split
Split the data into training and test sets using train_test_split() from sklearn.

b. Model Selection
Choose a classification algorithm (e.g., Logistic Regression, Decision Tree, Random Forest, SVM, or Neural Network).

c. Model Training
Train the model using the training data.

d. Hyperparameter Tuning
Use Grid Search or Random Search to optimize model parameters.

3. Model Evaluation

a. Evaluate on Test Data
Use metrics like accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrix.

b. Cross-Validation
Perform k-fold cross-validation to assess model stability and generalization.


In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Step 2: Load data
data = pd.read_csv('dataset.csv')

# Step 3: Define features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# Step 4: Data Preprocessing
# Identify numerical and categorical features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Numerical feature processing (impute missing values and scale)
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())])  # Standardize numerical data

# Categorical feature processing (impute missing values and encode)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])  # One-Hot Encoding for categorical features

# Combine the numeric and categorical transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Step 5: Create full pipeline (Preprocessing + Model)
clf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing steps
    ('classifier', RandomForestClassifier())])  # Classification model

# Step 6: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 7: Train the model
clf_pipeline.fit(X_train, y_train)

# Step 8: Make predictions on test set
y_pred = clf_pipeline.predict(X_test)

# Step 9: Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))



###Coding

1.Write a Python script to implement a decision tree classifier using Scikit-learn.


In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
data = pd.read_csv(dataset.csv)

X = data.drop('target', axis=1)  # Features
y = data['target']  # Target

# Step 3: Train-test split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Create a Decision Tree model
model = DecisionTreeClassifier(random_state=42)

# Step 5: Train the model on the training data
model.fit(X_train, y_train)

# Step 6: Make predictions on the test data
y_pred = model.predict(X_test)

# Step 7: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Step 8: Visualizing the Decision Tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
plot_tree(model, filled=True, feature_names=data.feature_names, class_names=data.target_names, rounded=True)
plt.show()


2.Given a dataset, write code to split the data into training and testing sets using an 80-20 split.


In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset
data = pd.read_csv('dataset.csv')

# Separate features (X) and target (y)
X = data.drop('target', axis=1)    #features
y = data['target']          #target

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###Case Study


A company wants to predict employee attrition. What kind of ML problem is this? Which algorithms would you choose and why?

Ans: Predicting employee attrition is a supervised classification problem, as the goal is to predict a categorical outcome (whether an employee will leave or stay) based on historical data. The target variable (attrition) is usually binary: 1 (employee leaves) or 0 (employee stays).

Algorithms to Use:

1. Logistic Regression:

Simple and effective for predicting binary outcomes.

Easy to interpret the results.

2. Decision Tree:

Good for understanding how features (like age, job satisfaction) affect attrition.

Can be visualized for better understanding.

3. Random Forest:

Combines multiple decision trees to improve accuracy.

Handles complex datasets well and reduces overfitting

Why These Algorithms?:

Logistic Regression and Decision Trees are simple and interpretable.

Random Forest improve performance on more complex datasets.

Steps:

Clean the data (handle missing values, encode categories).

Split the data into training and testing sets.

Train and evaluate models.

Use performance metrics like accuracy and F1-score to choose the best model.