# Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

### Design a pipeline that includes the following steps:

*  Use an automated feature selection method to identify the important features in the dataset.
 
*  Create a numerical pipeline that includes the following steps:

   * Impute the missing values in the numerical columns using the mean of the column values.

   *  Scale the numerical columns using standardisation.

*  Create a categorical pipeline that includes the following steps:

   *  Impute the missing values in the categorical columns using the most frequent value of the column.

   *  One-hot encode the categorical columns.

*  Combine the numerical and categorical pipelines using a ColumnTransformer.

*  Use a Random Forest Classifier to build the final model.

*  Evaluate the accuracy of the model on the test dataset.

**Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.**

Dataset link : https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/

**Step 1**: Load the dataset and perform a preliminary analysis.

In [5]:
import pandas as pd

# Load the dataset
df = pd.read_csv("Attrition.csv")

# Display the first few rows of the dataset and its basic info
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [6]:
df.shape

(1470, 35)

In [7]:
# Checking dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [8]:
# Checking missing values in dataset 
df.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

### No Missing values found in dataset

**Step 2**: Automated Feature Selection.

A brief overview:

1.  **Numerical Features**: Columns like 'Age', 'DailyRate', 'DistanceFromHome', etc.
2.  **Categorical Features**: Columns like 'Attrition', 'BusinessTravel', 'Department', etc.
3.  **Target Variable**: 'Attrition' (Yes or No)

In [9]:
from sklearn.ensemble import RandomForestClassifier

In [10]:
# Convert the target variable 'Attrition' to binary (0 or 1)
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

In [11]:
# Splitting the dataset into features (X) and target variable (y)
X = df.drop('Attrition', axis=1)
y = df['Attrition']

In [12]:
# Quick encoding of categorical features for feature importance evaluation
X_encoded = pd.get_dummies(X, drop_first=True)

In [13]:
# Initializing the RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_encoded, y)

In [14]:
# Getting feature importances
feature_importances = pd.Series(rf.feature_importances_, index=X_encoded.columns).sort_values(ascending=False)

feature_importances.head(10)  # Displaying the top 10 important features

MonthlyIncome        0.073584
Age                  0.056173
DailyRate            0.050601
TotalWorkingYears    0.048906
MonthlyRate          0.047312
OverTime_Yes         0.046191
HourlyRate           0.045562
EmployeeNumber       0.045040
DistanceFromHome     0.042096
YearsAtCompany       0.037223
dtype: float64

**The top 10 features, based on their importance as determined by the RandomForest model, are:**

1.    MonthlyIncome
2.    Age
3.    DailyRate
4.    HourlyRate
5.    TotalWorkingYears
6.    OverTime (specifically when OverTime is "Yes")
7.    MonthlyRate
8.    EmployeeNumber
9.    DistanceFromHome
10.    YearsAtCompany

**Step 3**: Build the Numerical Pipeline.

For this step, we'll:

1.    Identify numerical columns.
2.    Impute missing values using the mean.
3.    Scale these values using standardization.

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Identify numerical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())                  # Standardize numerical features
])

numerical_pipeline

The numerical pipeline is set up to:

1.    Impute missing values using the mean of the respective columns.
2.    Standardize (or scale) the numerical columns.

Step 4: Build the Categorical Pipeline.

For this step, we'll:

1.    Identify categorical columns.
2.    Impute missing values using the most frequent value.
3.    One-hot encode these values.

In [16]:
from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('encoder', OneHotEncoder(drop='first'))               # One-hot encode categorical features
])

categorical_pipeline

The categorical pipeline is set up to:

1.    Impute missing values using the most frequent value of the respective columns.
2.    One-hot encode the categorical columns, dropping the first category to avoid collinearity.

**Step 5**: Combine the Numerical and Categorical Pipelines using **`ColumnTransformer`**.
    
This step will allow us to process numerical and categorical columns separately, yet simultaneously.    

In [17]:
from sklearn.compose import ColumnTransformer

# Combining the numerical and categorical pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ])

preprocessor

The **`ColumnTransformer`** is now set up to process both numerical and categorical columns using their respective pipelines.

**Step 6**: Build and Train the Model using **'RandomForestClassifier'**.

Now, we'll integrate our preprocessing steps with the RandomForest classifier into one comprehensive pipeline. After that, we'll split our data into training and testing sets, and then train the model.

In [18]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
# Combining preprocessing with the classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [20]:
# Train the model
pipeline.fit(X_train, y_train)

In [21]:
# Evaluate the training accuracy
train_accuracy = pipeline.score(X_train, y_train)
train_accuracy

1.0

The training accuracy of the model is 1.0 (or 100%). This indicates that the RandomForest classifier has perfectly fit the training data. However, this could be a sign of overfitting. We'll get a clearer picture once we evaluate the model on the test dataset.

**Step 7**: Evaluate the Model Accuracy on the Test Dataset.

evaluate the model's performance on the test dataset

In [22]:
# Evaluate the test accuracy
test_accuracy = pipeline.score(X_test, y_test)
test_accuracy

0.8775510204081632

The model achieves an accuracy of approximately **87.76%** on the test dataset.

### Interpretation and Possible Improvements:

**1.Interpretation**: The RandomForest classifier performs quite well on this dataset, achieving nearly 88% accuracy on unseen data. However, the perfect training accuracy indicates that the model might be overfitting the training data.

**2.Possible Improvements:**

*    **Feature Selection**: While we used an automated feature importance method to identify key features, we didn't actually drop any features. Implementing a rigorous feature selection process might improve model generalization.

*    **Hyperparameter Tuning**: We can use techniques like grid search or random search to tune the hyperparameters of the RandomForest classifier.

*    **Regularization**: Regularization techniques can help prevent overfitting.

*    **Ensemble Methods**: Combining the predictions of multiple models can lead to better performance and more robust results.

*    **Alternative Models**: While RandomForest is a powerful tool, other models like Gradient Boosting Machines (e.g., XGBoost, LightGBM) or Neural Networks might offer improved performance on this dataset.

# Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

**Step 1**: Load the Iris dataset.

In [23]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the iris dataset again
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Splitting the data into training and testing sets (80% train, 20% test)
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

X_train_iris[:5], y_train_iris[:5]  # Displaying the first 5 rows of the training dataset

(array([[4.6, 3.6, 1. , 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [6.7, 3.1, 4.4, 1.4],
        [4.8, 3.4, 1.6, 0.2],
        [4.4, 3.2, 1.3, 0.2]]),
 array([0, 0, 1, 0, 0]))

**Step 2**: build individual pipelines for:

*    Random Forest Classifier
*    Logistic Regression Classifier

We'll first construct these pipelines and then proceed to the Voting Classifier.

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [25]:
# Pipeline for Random Forest Classifier
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),           # Standardize features
    ('classifier', RandomForestClassifier(random_state=42))
])

In [26]:
# Pipeline for Logistic Regression Classifier
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),           # Standardize features
    ('classifier', LogisticRegression(random_state=42))
])

rf_pipeline, lr_pipeline

(Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', RandomForestClassifier(random_state=42))]),
 Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', LogisticRegression(random_state=42))]))

**Step 3**: Now, combine the predictions from both classifiers using a Voting Classifier. We'll use a "soft" voting strategy, where the predicted probabilities from the classifiers are averaged, and the class with the highest probability is chosen as the prediction. This method often results in better performance than "hard" voting, which simply chooses the mode of the predictions.

In [27]:
from sklearn.ensemble import VotingClassifier

# Creating the voting classifier with both RandomForest and LogisticRegression
voting_clf = VotingClassifier(
    estimators=[
        ('rf', rf_pipeline),
        ('lr', lr_pipeline)
    ],
    voting='soft'
)

voting_clf

The Voting Classifier has been set up with both the Random Forest and Logistic Regression pipelines, using a "soft" voting strategy.

**Step 4**: train the combined pipeline on the Iris dataset.

In [28]:
# Train the voting classifier on the training data
voting_clf.fit(X_train_iris, y_train_iris)

# Evaluate the training accuracy of the combined model
train_accuracy_iris = voting_clf.score(X_train_iris, y_train_iris)
train_accuracy_iris

1.0

The combined model achieves a training accuracy of 1.0 (or 100%) on the Iris dataset.

**Step 5**: now evaluate the accuracy of the combined model on the test dataset.

In [29]:
# Evaluate the test accuracy of the combined model
test_accuracy_iris = voting_clf.score(X_test_iris, y_test_iris)
test_accuracy_iris

1.0

The combined model also achieves an accuracy of 1.0 (or 100%) on the test dataset.

**Interpretation:**

The Iris dataset is relatively simple, and it's not uncommon for models to achieve high accuracy on it. The Voting Classifier, which combines the predictions of the Random Forest and Logistic Regression models, successfully classifies all test instances.