Q1. **What is feature engineering, and how does it work? Explain the various aspects of feature engineering in depth**

**Feature engineering** is the process of designing and creating new features from existing data in order to improve the performance of a machine learning model. It is a critical step in the ***machine learning process***, as the quality and relevance of the features can have a significant impact on the model's ability to learn and make accurate predictions.

_There are several aspects of feature engineering that can be considered:_

`Data preprocessing`: This involves cleaning and preparing the raw data for analysis, such as filling in missing values, handling outliers, and removing irrelevant or redundant features.

`Feature transformation`: This involves applying mathematical transformations to the features of the dataset in order to change their scale or distribution. Some common types of feature transformations include standardization, normalization, and log transformation.

`Feature construction`: This involves creating new features from existing data by combining, aggregating, or summarizing existing features in a meaningful way. For example, a new feature could be created by taking the mean of a group of features, or by combining two features using a mathematical operation such as addition or multiplication.

`Feature selection`: This involves selecting a subset of relevant features from the dataset to use in the model. The goal of feature selection is to reduce the dimensionality of the dataset and remove irrelevant or redundant features that do not contribute to the prediction task.

`Feature extraction`: This involves using techniques such as dimensionality reduction or feature decomposition to extract important features from the dataset. These techniques can help to identify the underlying structure of the data and extract the most relevant features for the prediction task.

Overall, the goal of feature engineering is to improve the performance of the machine learning model by selecting and constructing relevant and informative features that capture the patterns and relationships in the data.

Q2. **What is feature selection, and how does it work? What is the aim of it? What are the various methods of function selection?**

**Ans**: Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

There are **three types** of feature selection:

1. Wrapper methods (forward, backward, and stepwise selection)
2. Filter methods (ANOVA, Pearson correlation, variance thresholding)
3. Embedded methods (Lasso, Ridge, Decision Tree).

Q3. **Describe the function selection filter and wrapper approaches. State the pros and cons of each approach?**

**Ans**: The main differences between the filter and wrapper methods for feature selection are:

1. Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
2. Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
3. Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
4. Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
5. Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

Q4. **Please Answer the following Questions :** <br>
- Describe the overall feature selection process. <br>
- Explain the key underlying principle of feature extraction using an example. What are the most widely used function extraction algorithms?

**Ans**: 
Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. There are three types of feature selection:

**Wrapper methods** (forward, backward, and stepwise selection): In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.

**Filter methods** (ANOVA, Pearson correlation, variance thresholding): Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. The correlation is a subjective term here

**Embedded methods** (Lasso, Ridge, Decision Tree): Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods. Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting. Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients. Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.

Q5. **Describe the feature engineering process in the sense of a text categorization issue?**

In the context of a text categorization problem, feature engineering involves designing and creating features from the text data that can be used to predict the category of the text. Some common steps in the feature engineering process for a text categorization problem are:

**Data preprocessing**: This involves cleaning and preparing the raw text data for analysis. This may include tasks such as lowercasing, stemming, and removing punctuation and stop words.

**Feature transformation**: This may involve applying techniques such as term frequency-inverse document frequency (TF-IDF) to the text data in order to weight the importance of each term in the document relative to the entire corpus.

**Feature construction**: This may involve creating new features from the text data by combining, aggregating, or summarizing existing features in a meaningful way. For example, a new feature could be created by taking the mean of the TF-IDF values of a group of terms, or by combining the presence or absence of certain terms using a binary operation.

**Feature selection**: This may involve selecting a subset of relevant features from the dataset to use in the model. The goal of feature selection is to reduce the dimensionality of the dataset and remove irrelevant or redundant features that do not contribute to the prediction task.

**Feature extraction**: This may involve using techniques such as latent semantic analysis or singular value decomposition to extract important features from the text data. These techniques can help to identify the underlying structure of the data and extract the most relevant features for the prediction task.

Overall, the goal of feature engineering in a text categorization problem is to design and create relevant and informative features from the text data that can be used to predict the category of the text.

Q6. **What makes cosine similarity a good metric for text categorization? A document-term matrix has two rows with values of (2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3, 1). Find the resemblance in cosine.**

**Ans**: Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes).

The formula for calculating the cosine similarity is : **`Cos(x, y) = x . y / ||x|| * ||y||`**

`In our Question cos(x,y) = 23/(root 40 * root 29) = 0.675`

Q7. **Explain the following:**
1. What is the formula for calculating Hamming distance? Between 10001011 and 11001111, calculate the Hamming gap.
2. Compare the Jaccard index and similarity matching coefficient of two features with values (1,1,0,0,1,0,1,1) and (1,1,0,0, 0,1,1,1), respectively (1,0,0,1,1,0,0,1)

**Answer**:
1. The Hamming distance between two vectors is the number of bits we must change to change one into the other. Example Find the distance between the vectors 01101010 and 11011011. They differ in four places, so the Hamming distance `d(01101010,11011011) = 4`. In question mentioned between 10001011 and 11001111, hamming distance will be 2 as two character are different.

2. Jaccard Index = (the number in both sets) / (the number in either set) * 100 For Question given, Jaccard Index = 2/2 *100 = 100%

Q8.**State what is meant by "high-dimensional data set"? Could you offer a few real-life examples? What are the difficulties in using machine learning techniques on a data set with many dimensions? What can be done about it?**

**Ans**:

_High dimension_ is when variable numbers p is higher than the sample sizes n i.e. p>n, cases. High dimensional data is referred to a data of n samples with p features, where p is larger than n.

For example, tomographic imaging data, ECG data, and MEG data. One example of high dimensional data is microarray gene expression data.

Q9. **Make a few quick notes on:**
1. PCA is an acronym for Personal Computer Analysis.
2.  Use of vectors
3.  Embedded technique

**Ans**:
The **Principal component analysis (PCA)** is a technique used for identification of a smaller number of uncorrelated variables known as principal components from a larger set of data. The technique is widely used to emphasize variation and capture strong patterns in a data set.

**Vectors** can be used to represent physical quantities. Most commonly in physics, vectors are used to represent displacement, velocity, and acceleration. Vectors are a combination of magnitude and direction, and are drawn as arrows

In the context of machine learning, an **embedding techniques** is a low-dimensional, learned continuous vector representation of discrete variables into which you can translate high-dimensional vectors. Generally, embeddings make ML models more efficient and easier to work with, and can be used with other models as well

Q10. **Make a comparison between:**
1. Sequential backward exclusion vs. sequential forward selection
2. Function selection methods: filter vs. wrapper
3. SMC vs. Jaccard coefficient

1. Sequential backward exclusion and sequential forward selection are two methods that can be used for feature selection, which is the process of selecting a subset of relevant features from a dataset to use in a machine learning model.

Sequential backward exclusion (also known as backward selection) is a method of feature selection that starts with all of the features in the dataset and iteratively removes the least important features until the desired number of features is reached. The process of removing features is guided by a performance metric, such as accuracy or AUC, which is used to evaluate the performance of the model on the training data after each feature is removed. The features that have the greatest impact on the performance of the model are kept, while the least important features are removed.

Sequential forward selection (also known as forward selection) is a method of feature selection that starts with an empty set of features and iteratively adds the most important features to the set until the desired number of features is reached. The process of adding features is also guided by a performance metric, which is used to evaluate the performance of the model on the training data after each feature is added. The features that have the greatest impact on the performance of the model are added to the set, while the least important features are ignored.

In summary, sequential backward exclusion starts with all of the features in the dataset and iteratively removes the least important ones, while sequential forward selection starts with an empty set of features and iteratively adds the most important ones. Both methods aim to select a subset of relevant features that can improve the performance of the machine learning model.

2. **The main differences between the filter and wrapper methods for feature selection are:**

- Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
- Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
- Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
- Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
- Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.

3. **SMC vs. Jaccard coefficient**

SMC (short for "simple matching coefficient") and Jaccard coefficient are two measures that can be used to calculate the similarity between two sets.

The simple matching coefficient (SMC) is a measure of the similarity between two sets that counts the number of elements that are shared by both sets and divides it by the total number of elements in both sets. The SMC can be calculated using the following formula:

`SMC(A,B) = |A ∩ B| / |A ∪ B|`

where A and B are the two sets, ∩ represents the intersection of the sets (i.e., the elements that are common to both sets), and ∪ represents the union of the sets (i.e., the elements that are in either set).

The Jaccard coefficient (also known as the Jaccard index) is a measure of the similarity between two sets that counts the number of elements that are shared by both sets and divides it by the total number of elements in either set. The Jaccard coefficient can be calculated using the following formula:

`Jaccard(A,B) = |A ∩ B| / |A ∪ B|`

Both the SMC and Jaccard coefficient range from 0 to 1, where a value of 0 indicates that the sets have no elements in common, and a value of 1 indicates that the sets are identical.

In summary, the SMC and Jaccard coefficient are measures of the similarity between two sets that count the number of elements that are shared by both sets and divide it by a measure of the total number of elements in the sets. The main difference between the two measures is in the denominator of the formula, which determines how the total number of elements in the sets is calculated.