## Q1. What is the curse of dimensionality reduction and why is it important in machine learning?

Ans-Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data. The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data. Some of the difficulties that come with high dimensional data manifest during analyzing or visualizing the data to identify patterns, and some manifest while training machine learning models. The difficulties related to training machine learning models due to high dimensional data are referred to as the ‘Curse of Dimensionality

Machine Learning
In Machine Learning, a marginal increase in dimensionality also requires a large increase in the volume in the data in order to maintain the same level of performance. The curse of dimensionality is the by-product of a phenomenon which appears with high-dimensional data.

## Q2. How does the curse of dimensionality impact the performance of machine learning algorithms?

Ans-As the number of dimensions or features increases, the amount of data needed to generalize the machine learning model accurately increases exponentially. The increase in dimensions makes the data sparse, and it increases the difficulty of generalizing the model. More training data is needed to generalize that model better.

The higher dimensions lead to equidistant separation between points. The higher the dimensions, the more difficult it will be to sample from because the sampling loses its randomness.

It becomes harder to collect observations if there are plenty of features. These dimensions make all observations in the dataset to be equidistant from all other observations. The clustering uses Euclidean distance to measure the similarity between the observations. The meaningful clusters can’t be formed if the distances are equidistant.

## Q3. What are some of the consequences of the curse of dimensionality in machine learning, and how do they impact model performance?

Ans-Effect of Curse of Dimensionality on Distance Functions:
Therefore, any machine learning algorithms which are based on the distance measure including KNN(k-Nearest Neighbor) tend to fail when the number of dimensions in the data is very high

## Q4. Can you explain the concept of feature selection and how it can help with dimensionality reduction?

Ans-There are two components of dimensionality reduction:

Feature selection: In this, we try to find a subset of the original set of variables, or features, to get a smaller subset which can be used to model the problem. It usually involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon the method used. The prime linear method, called Principal Component Analysis, or PCA.

Dimensionality reduction is the process of reducing the number of features in a dataset while retaining as much information as possible.
This can be done to reduce the complexity of a model, improve the performance of a learning algorithm, or make it easier to visualize the data.
Techniques for dimensionality reduction include: principal component analysis (PCA), singular value decomposition (SVD), and linear discriminant analysis (LDA).
Each technique projects the data onto a lower-dimensional space while preserving important information.
Dimensionality reduction is performed during pre-processing stage before building a model to improve the performance
It is important to note that dimensionality reduction can also discard useful information, so care must be taken when applying these techniques.

## Q5. What are some limitations and drawbacks of using dimensionality reduction techniques in machine learning?

Ans-Dimensionality reduction brings many advantages to your machine learning data, including:

Fewer features mean less complexity

You will need less storage space because you have fewer data

Fewer features require less computation time

Model accuracy improves due to less misleading data

Algorithms train faster thanks to fewer data

Reducing the data set’s feature dimensions helps visualize the data faster

It removes noise and redundant features

Benefits Of Dimensionality Reduction
For AI engineers or data professionals working with enormous datasets, doing data visualisation, and analysing complicated data, dimension reduction is helpful. 

1.It aids in data compression, resulting in less storage space being required.

2.It speeds up the calculation.

3.It also aids in removing any extraneous features.


Disadvantages Of Dimensionality Reduction:

1.We lost some data during the dimensionality reduction process, which can impact how well future training algorithms work.

2.It may need a lot of processing power.

3.Interpreting transformed characteristics might be challenging.

4.The independent variables become harder to comprehend as a result.

## Q6. How does the curse of dimensionality relate to overfitting and underfitting in machine learning?

Ans-Overfitting and Underfitting
There is a relationship between 'd' and overfitting which is as follows: 'd' is directly proportional to overfitting i.e. as the dimensionality increases the chances of overfitting also increases.

a) Model-dependent approach: Whenever we have a large number of features, we can always perform forward feature selection to determine the most relevant features for the prediction.

b) Unlike the above solution which is classification-oriented, we can also perform dimensionality reduction techniques like PCA and t-SNE which do not use the class labels to determine the most relevant features for the prediction.

So it is important to keep in mind whenever you download a new dataset that has a large number of features, you can reduce it by some of the techniques like PCA, t-SNE, or forward selection in order to ensure your model is not affected by the curse of dimensionality.

## Q7. How can one determine the optimal number of dimensions to reduce data to when using dimensionality reduction techniques?

Ans-ommon Dimensionality Reduction Techniques
Dimensionality reduction can be done in two different ways:

By only keeping the most relevant variables from the original dataset (this technique is called feature selection)
By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables (this technique is called dimensionality reduction)
We will now look at various dimensionality reduction techniques and how to implement each of them in Python

Missing Value Ratio
Suppose you’re given a dataset. What would be your first step? You would naturally want to explore the data first before building model. While exploring the data, you find that your dataset has some missing values. Now what? You will try to find out the reason for these missing values and then impute them or drop the variables entirely which have missing values (using appropriate methods).

What if we have too many missing values (say more than 50%)? Should we impute the missing values or drop the variable? I would prefer to drop the variable since it will not have much information. However, this isn’t set in stone. We can set a threshold value and if the percentage of missing values in any variable is more than that threshold, we will drop the variable.

In [6]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# read the data
train=pd.read_csv("Train.csv")

# checking the percentage of missing values in each variable
train.isnull().sum()/len(train)*100

# saving missing values in a variable
a = train.isnull().sum()/len(train)*100
# saving column names in a variable
variables = train.columns
variable = [ ]
for i in range(0,12):
    if a[i]<=20:   #setting the threshold as 20%
        variable.append(variables[i])