<a href="https://colab.research.google.com/github/Riturajkumari/PCA/blob/main/PCA_Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1. What is the curse of dimensionality reduction and why is it important in machine learning?**

- The curse of dimensionality basically refers to the difficulties a machine learning algorithm faces when working with data in the higher dimensions, that did not exist in the lower dimensions.

- higher dimensional data is dominated by a rather small number of features. If we can find a subset of the superfeatures that can represent the information just as well as the original dataset, we can remove the curse of dimensionality.

    - Avoids overfitting - the lesser assumptions a model makes, the simpler it will be.
    - Easier computation - the lesser the dimensions, the faster the model trains.
    - Improved model performance -  removes redundant features and noise, lesser misleading data improves model accuracy.
    - Lower dimensional data requires less storage space.
    - Lower dimensional data can work with other algorithms that were unfit for larger dimensions.

- A large number of input dimensions can cause a model to slow down during execution. So, we perform Principal Component Analysis (PCA) on the model to speed up the fitting of the ML algorithm.

PCA projects data in the direction of increasing variance. The features having the highest variance are the principal components.

- Several techniques can be employed for dimensionality reduction depending on the problem and the data. These techniques are divided into two broad categories:
    - Feature Selection: Choosing the most important features from the data

    - Feature Extraction: Combining features to create new superfeatures.

    - Loading the dataset
    - Standardizing the data onto a unit scale
    - Projecting PCA to two-dimensions
    - Concatenating the Principal Components with the target variable
    - Visualizing the 2D Projection

In [1]:
import seaborn as sns
df= sns.load_dataset('iris')

In [2]:
df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:

# split the data
X = df.drop("species", axis=1)
y = df['species']


test_size = 0.30 # taking 70:30 training and test set

seed = 7  # Random number seeding for reapeatability of the code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [4]:
#  Standardizing the data onto a unit scale
from sklearn.preprocessing import StandardScaler

features = ['sepal_length','sepal_width','petal_length','petal_width']

#Standardizing the features

x = StandardScaler().fit_transform(X)

In [5]:
# Step 3 – Projecting PCA to 2D
import pandas as pd
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

components = pca.fit_transform(x)

data= pd.DataFrame(data= components,

                  columns = ['PC1', 'PC2'])

data.head()

Unnamed: 0,PC1,PC2
0,-2.264703,0.480027
1,-2.080961,-0.674134
2,-2.364229,-0.341908
3,-2.299384,-0.597395
4,-2.389842,0.646835


- Now, we will be projecting this data into two principal components – PC1 and PC2. These will now be our main dimensions of variance.

In [6]:
# Step 4 – Concatenating the Principal Components with the target variable
final_data = pd.concat([data, df[['species']]], axis = 1)

final_data.head()

Unnamed: 0,PC1,PC2,species
0,-2.264703,0.480027,setosa
1,-2.080961,-0.674134,setosa
2,-2.364229,-0.341908,setosa
3,-2.299384,-0.597395,setosa
4,-2.389842,0.646835,setosa


**Q2. How does the curse of dimensionality impact the performance of machine learning algorithms?**

- The curse of dimensionality is a problem that arises when working with high-dimensional data, meaning data with many features or columns.
- The problem refers to the fact that algorithms become harder to design and have a running time exponential in the dimensions.
- The curse of dimensionality basically means that the error increases with the increase in the number of features. It refers to the fact that algorithms are harder to design in high dimensions and often have a running time exponential in the dimensions.

   **high-dimensional data results from the conjunction of two effects:**

- High-dimensional spaces have geometrical properties which are counter-intuitive, and far from the properties which can be observed in two-or three-dimensional spaces.
- Data analysis tools are most often designed having in mind intuitive properties and examples in low-dimensional spaces and usually, data analysis tools are best illustrated in 2-or 3-dimensional spaces.

**Q3. What are some of the consequences of the curse of dimensionality in machine learning, and how do
they impact model performance?**

- **curse of dimensionality in machine learning**  is defined as
number of dimensions or features increases, the amount of data needed to generalize the machine learning model accurately increases exponentially. The increase in dimensions makes the data sparse, and it increases the difficulty of generalizing the model. More training data is needed to generalize that model better.

- The curse of dimensionality is a term used to describe the problems that arise when working with high-dimensional data.
       - Some of the consequences of the curse of dimensionality are:
- Increased computational complexity: As the number of dimensions increases, the amount of data required to fill the space grows exponentially.
- ncreased sparsity: High-dimensional data is often sparse, meaning that most of the data points are far apart from each other.
- Overfitting: Overfitting occurs when a model is too complex and fits the training data too closely. This can lead to poor generalization performance on new data.
- Underfitting: Underfitting occurs when a model is too simple and does not capture all the relevant features in the data. This can also lead to poor generalization performance.

**Q4. Can you explain the concept of feature selection and how it can help with dimensionality reduction?**

- Feature selection is a process of selecting a subset of relevant features for use in model construction. It is used to reduce the number of input variables when developing a predictive model. Feature selection techniques are used for several reasons:

    - Simplification of models to make them easier to interpret by researchers/users.
    - Shorter training times.
    - To avoid the curse of dimensionality.
    - Enhanced generalization by reducing overfitting (reducing variance).

- Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction. Feature selection involves selecting a subset of the original features that are most relevant to the problem at hand. Feature extraction involves transforming the data in the high-dimensional space to a space of fewer dimensions.

**Q5. What are some limitations and drawbacks of using dimensionality reduction techniques in machine
learning?**

- Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much of the important information as possible. In other words, it is a process of transforming high-dimensional data into a lower-dimensional space that still preserves the essence of the original data.



- Dimensionality reduction is one of the most important techniques in machine learning that has been widely used in many applications. It is a process of reducing the number of variables or features in a dataset while preserving the most important information or patterns. This technique has become essential in machine learning, particularly in high-dimensional data, where the number of features is larger than the number of samples, causing overfitting, computational complexity, and poor performance of models.

**Q6. How does the curse of dimensionality relate to overfitting and underfitting in machine learning?**

- Overfitting and Underfitting are the two main problems that occur in machine learning and degrade the performance of the machine learning models.
 - Overfitting occurs when our machine learning model tries to cover all the data points or more than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model. The overfitted model has low bias and high variance.

 - avoid the Overfitting in Model
Both overfitting and underfitting cause the degraded performance of the machine learning model. But the main cause is overfitting, so there are some ways by which we can reduce the occurrence of overfitting in our model.

   - Cross-Validation
   - Training with more data
   - Removing features
   - Early stopping the training
   - Regularization
   - Ensembling

- Underfitting:

Underfitting occurs when our machine learning model is not able to capture the underlying trend of the data. To avoid the overfitting in the model, the fed of training data can be stopped at an early stage, due to which the model may not learn enough from the training data. As a result, it may fail to find the best fit of the dominant trend in the data.

**Q7. How can one determine the optimal number of dimensions to reduce data to when using
dimensionality reduction techniques?**

- The optimal number of dimensions to reduce data to when using dimensionality reduction techniques depends on the dataset and the problem you are trying to solve. There is no one-size-fits-all answer. However, there are some methods that can help you determine the optimal number of dimensions.
- method is PCA (Principal Component Analysis). PCA is a popular technique for dimensionality reduction. It works by finding the principal components of the data and projecting the data onto these components. The principal components are the directions in which the data varies the most.