# Pipeline

#### In machine learning, a pipeline refers to a sequence of data processing steps, where the output of one step is the input to the next. This concept is essential for building robust and maintainable machine learning workflows. A pipeline allows you to streamline the process of preparing data, training models, and making predictions, ensuring that all steps are performed in a consistent and reproducible manner.

## Reference Links :
- https://youtu.be/XvxptZMUo2o?feature=shared
- https://youtu.be/TLjMibCN6v4?feature=shared

### Here’s a detailed breakdown of what a pipeline typically involves:

### 1. Data Preprocessing:
- Data Cleaning: Handling missing values, removing duplicates, and correcting errors in the data.
- Data Transformation: Scaling features, encoding categorical variables, and applying mathematical transformations.
- Feature Extraction: Deriving new features from raw data, such as extracting date components from timestamps or creating interaction terms.

### 2.Feature Selection:
- Identifying and selecting the most relevant features that contribute to the predictive power of the model, potentially reducing dimensionality and improving model performance.

### 3.Model Training:
- Model Selection: Choosing the appropriate machine learning algorithm(s) based on the problem type (e.g., classification, regression).
- Training: Fitting the selected model(s) to the training data to learn the underlying patterns and relationships.

### 4.Model Evaluation:
- Assessing the performance of the trained model using appropriate metrics (e.g., accuracy, precision, recall, RMSE) on a validation dataset to ensure it generalizes well to new, unseen data.

### 5.Model Tuning:
- Optimizing model hyperparameters using techniques like grid search, random search, or Bayesian optimization to enhance performance.

### 6.Prediction:
- Applying the trained and tuned model to new data to make predictions or classify new instances.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
X = ...  # Features
y = ...  # Target

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the pipeline
pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),      # Handle missing values
    ('scaler', StandardScaler()),                     # Scale features
    ('classifier', RandomForestClassifier())          # Train a classifier
])

# Training the model
pipeline.fit(X_train, y_train)

# Making predictions
y_pred = pipeline.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


## Example Using Titanic Dataset

In [25]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.tree import DecisionTreeClassifier

In [2]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True) # feature selection

In [4]:
# train/test/split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['Survived']),
                                                 df['Survived'],
                                                 test_size=0.2,
                                                random_state=42)

In [5]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [6]:
y_train.head()

331    0
733    0
382    0
704    0
813    0
Name: Survived, dtype: int64

In [7]:
# imputation transformer
trf1 = ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')

ColumnTransformer: This is used to apply different preprocessing steps to different columns of a dataset.

Transformers inside ColumnTransformer:

- ('impute_age', SimpleImputer(), [2]): This specifies that a SimpleImputer should be applied to the column at index 2 (often corresponding to the "age" column in a dataset). By default, SimpleImputer fills missing values with the mean of the column.
- ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]): This specifies that another SimpleImputer should be applied to the column at index 6 (often corresponding to the "embarked" column in a dataset). Here, the strategy='most_frequent' argument is used to fill missing values with the most frequent value in that column.
- remainder='passthrough': This specifies that all other columns (those not explicitly listed in the transformers) should be passed through without any changes.

In [9]:
# Create a ColumnTransformer for one-hot encoding
trf2 = ColumnTransformer([
    # Apply one-hot encoding to 'sex' (index 1) and 'embarked' (index 6) columns
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1, 6])
], remainder='passthrough')  # Leave all other columns unchanged


In [10]:
# Scaling
trf3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

In [11]:
# train the model
trf4 = DecisionTreeClassifier()

In [12]:
pipe = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
("trf3",trf3),
    ('trf4',trf4)
])

- The provided code constructs a Pipeline in scikit-learn, which sequentially applies a series of transformations to the data before fitting a model.
- Each transformation is applied sequentially, with the output of one transformation being the input to the next one. Finally, this pipeline can be used to fit a model.

In [13]:
pipe

## Pipeline Vs make_pipeline

- Pipeline requires naming of steps, make_pipeline does not.
- Use Pipeline for more complex pipelines where step names are important, and use make_pipeline for simpler pipelines where step names are less critical.

In [15]:
# Alternate Syntax
pipe = make_pipeline(trf1,trf2,trf3,trf4)

In [16]:
pipe

In [17]:
pipe.fit(X_train,y_train)

In [18]:
# Predict
y_pred = pipe.predict(X_test)

In [19]:
y_pred

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6256983240223464

In [21]:
# export
import pickle
pickle.dump(pipe,open('pipe.pkl','wb'))

In [22]:
pipe = pickle.load(open('pipe.pkl','rb'))

In [23]:
# Assume user input
# reshapes this array into a 2D array with shape (1, 7) to match the expected input format for predict method
test_input2 = np.array([2, 'male', 31.0, 0, 0, 10.5, 'S'],dtype=object).reshape(1,7)

In [26]:
# Depending on the preprocessing steps defined in pipe, this input will undergo transformations before being fed into the model for prediction.
pipe.predict(test_input2)

array([0], dtype=int64)

# ML Interview Questions

### 1. What are the types of ML algorithms?- supervised
-  Unsupervised
-  Reinforcement learning


### 2. difference b\w supervised and unsupervised algorithms?
Supervised learning uses labelled data to train algorithms that make predictions or classify data based on learned patterns, while unsupervised learning works with unlabeled data to discover inherent structures and relationships within the data without predefined target values.

- Eg - Supervised Learning:
Linear regression, decision trees, support vector machines, and neural networks are examples of supervised learning algorithms.

- Eg - unsupervised
K-means clustering, hierarchical clustering, principal component analysis (PCA), and autoencoders are examples of unsupervised learning algorithms.

### 3. Measures of central tendency and their business use cases
Measures of central tendency, such as mean, median, and mode, are statistical tools used to summarise the central or typical value of a dataset. Here's a brief explanation of each measure and its business use case:

- Mean:
The mean is the average value of a dataset and is calculated by summing all values and dividing by the number of observations.
Business Use Case: The mean is widely used in business for various purposes, such as calculating average sales, average salary, or average customer satisfaction rating. It provides a straightforward measure of the typical value and helps in making decisions related to resource allocation, budgeting, and performance evaluation.

- Median:
The median is the middle value of a sorted dataset. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values.
Business Use Case: The median is robust to outliers and skewed distributions, making it useful for analysing skewed data, such as income distribution, housing prices, or project completion times. It provides insight into the central tendency of the dataset while being less affected by extreme values.

- Mode:
The mode is the value that appears most frequently in a dataset.
Business Use Case: The mode is useful for categorical or nominal data, such as product categories, customer preferences, or types of defects in manufacturing. Identifying the mode helps in understanding the most common or popular category and can inform marketing strategies, inventory management, or quality control efforts.

In summary, measures of central tendency provide valuable insights into the typical values of a dataset and are essential tools for making informed business decisions across various domains.

### 4. What is bias variance trade off?
The bias-variance tradeoff is a balance between the error introduced by oversimplifying a model (bias) and the error introduced by the model's sensitivity to noise (variance). The trade off arises because reducing bias often increases variance and vice versa. Achieving a good balance between bias and variance is essential for building models that generalize well to new, unseen data.
us domains.


### 5. Which optimization technique is used in linear regression?
A.The optimization technique commonly used in linear regression is called "Gradient Descent." Gradient Descent is an iterative optimization algorithm that aims to find the optimal values for the model's parameters (coefficients) that minimise the error between the predicted values and the actual target values. In the case of linear regression, the goal is to find the coefficients that minimise the mean squared error (MSE) or a similar cost function.


### 6. What do you mean by a cost function ?
Cost functions, also known as objective functions or loss functions, are used to measure the performance of a machine learning algorithm by quantifying the difference between predicted and actual values. 

Here are some common cost functions used in different algorithms:
- Linear Regression:

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

- Logistic Regression:

Binary Cross-Entropy Loss (Log Loss)

- Support Vector Machines (SVM)

Hinge Loss (used for binary classification)

- Neural Networks:

Mean Squared Error (MSE): (for regression tasks)

Binary Cross-Entropy Loss (Log Loss): (for binary classification tasks)

Categorical Cross-Entropy Loss (Softmax Loss): (for multiclass classification tasks)

- k-Means

Within-Cluster Sum of Squares (WCSS)

- Decision Trees and Random Forests

Gini Impurity

Entropy

### 7. Which technique is used in random forest?
Ensemble learning

### 8. What is ensemble learning ?
Ensemble learning is a technique where multiple models are combined to make predictions. By leveraging the strengths of diverse models, ensemble methods often outperform individual models, leading to more accurate and robust predictions.
Common ensemble methods include - bagging, boosting, stackin 


### 9. what is k in k-means
In k-means clustering, kkk refers to the number of clusters that the algorithm aims to partition the data into. The value of kkk is pre-specified by the user and represents the number of centroids (cluster centres) that the algorithm will seek to identify in the data. Each data point is assigned to the nearest centroid, and the goal is to minimise the distance between data points and the centroid of their assigned cluster. Therefore, the choice of kkk directly influences the structure and granularity of the resulting clusters.


### 10. How to choose the best k value in k means?

1. Elbow Method:
- Plot the within-cluster sum of squares (WCSS) against the number of clusters (kkk).
- Look for the "elbow" point, where the rate of decrease in WCSS slows down.
- This point indicates the optimal k where adding more clusters doesn't significantly improve performance.

2. Silhouette Score:
- Calculate the silhouette score for different values of k.
- Choose the k value that maximises the silhouette score.
- A higher silhouette score indicates better-defined clusters and better separation between clusters.

3. Domain Knowledge:
- Consider the context and domain-specific insights.
- Choose k that aligns with expected or meaningful groupings in the data.

### 11. What is NLP
NLP, or Natural Language Processing, is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It involves the development of algorithms and models to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful for various applications, such as text analysis, language translation, chatbots, and sentiment analysis.


### 12. What is lemmatization?
Lemmatization is an NLP technique that reduces words to their base or root form  known as  "lemma", by considering the context and intended meaning, improving text processing accuracy.

- The words "running," "ran," and "runs" would all be reduced to the lemma "run."
- The words "better" and "best" would be reduced to the lemma "good."


### 13. What do you mean by stop words?
Stop words are common words in a language that are often filtered out during text preprocessing in natural language processing (NLP) tasks. These words typically do not carry significant meaning or contribute to the understanding of the text. Examples of stop words in English include "the", "is", "and", "of", "in", etc. Removing stop words can help reduce the dimensionality of the data, improve computational efficiency, and focus on the more informative words in the text. However, the list of stop words may vary depending on the context and specific NLP task.


### 14.  what do you mean by NLTK
NLTK stands for Natural Language Toolkit. It's a comprehensive Python library for natural language processing (NLP) tasks. NLTK provides tools, algorithms, and corpora to work with human language data. It's widely used for tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and semantic reasoning. Additionally, NLTK includes various datasets and resources for training and testing NLP models. It's a popular choice for researchers, educators, and practitioners working in the field of natural language processing.


### 15.  How did you rectify curse of dimensionality problem?The curse of dimensionality refers to the challenges and limitations that arise when working with high-dimensional data.

Here are some common strategies to address or mitigate the curse of dimensionality:
1. Feature Selection
- Identify and select the most relevant features that contribute the most to the predictive performance of the model

2. Feature Extraction:
- Use dimensionality reduction techniques to transform high-dimensional data into a lower-dimensional space while preserving most of the relevant information.
- Principal Component Analysis (PCA) and (t-SNE) are popular methods for feature extraction.

3. Regularization:
- Apply regularization techniques to penalize large coefficients and prevent overfitting in high-dimensional spaces.
- Regularization methods like L1 (Lasso) and L2 (Ridge) regularization can help encourage sparsity and prevent overfitting.



### 16. Bayes' theorem
Bayes' theorem is a fundamental concept in probability theory, and it plays a crucial role in machine learning, particularly in probabilistic models and Bayesian inference. Bayes' theorem is a mathematical formula that describes the probability of an event based on prior knowledge or information. 

P(A/B) = (P(B/A) * P(A)) / P(B)

- P(A∣B) is the conditional probability of event A given event B.
- P(B∣A) is the conditional probability of event B given event A.
- P(A) and P(B)P(B)P(B) are the probabilities of events A and B, respectively.

In machine learning, Bayes' theorem is often used in the context of classification problems, particularly in Bayesian classifiers. These classifiers calculate the probability that a given data point belongs to a certain class based on the observed features of the data. Bayes' theorem helps update the probability of each class based on the observed evidence (features) and prior knowledge about the class distributions.