# Visualizing the PCA Transformation

**PCA follows the fit/transform pattern**

In [None]:
"""

PCA (Principal Component Analysis) in scikit-learn works in two steps:
1) Fit: Learns how to shift and rotate the data to find the best lower-dimensional representation.
2) Transform: Actually applies this transformation to the data, reducing its dimensions.

Example to Make It Clear
A dataset of students' grades in Math, Science, and English.
want to reduce the number of features while keeping most of the important information.

Step 1: Fit
----------------
PCA analyzes the data and figures out how to rotate it so that the most important patterns appear first.
This step does not change the data—it just learns how to transform it.


Step 2: Transform
------------------
Now, PCA applies the learned transformation to actually reduce the dimensions.
For example, instead of having 3 features (Math, Science, English), it might reduce it to just 2 principal components.


New, unseen data (e.g., a new student’s grades) can also be transformed using the same learned transformation!



"""

**PCA Features**

In [None]:
"""

Step 1: Import PCA and Fit the Data
------------------------------------------
1) First, import PCA from sklearn.decomposition.
2) Then, create a PCA object and fit it to your dataset. This step learns how to reduce the dimensions.
3) Finally, transform the dataset using the trained PCA object. This returns a new dataset with PCA features.

A wine dataset with features like:

 Alcohol content
 Sugar level
 Acidity

If we apply PCA, it creates new features (PCA features) that combine information from the original features.

Step 2: Understanding the Transformed Data
-------------------------------------------------
 The transformed dataset still has the same number of rows (one for each wine sample).
 But the columns now represent PCA features instead of the original features like alcohol or acidity.

"""

**PCA features are not correlated**

In [None]:
"""

1) PCA Removes Correlation Between Features
In many datasets, some features are correlated, meaning they change together.

In a wine dataset
  Alcohol % and Sugar Level might be correlated (higher alcohol means lower sugar).
  Acidity and Sweetness might also be related.

But PCA transforms the data in such a way that the new features (PCA features) are not correlated anymore.
This happens because PCA rotates the data in a way that spreads the important information across new axes.







PCA is designed to reduce redundancy in data. Correlated features contain overlapping information, meaning some features are not adding much new insight.
PCA transforms the data so that each new feature (PCA component) captures unique patterns in the dataset.


Suppose while analyzing students' performance, and we have "Hours of Study" and "Number of Pages Read" as features.
These two features are highly correlated because studying more often means reading more pages.
PCA will combine these into a single "Study Effort" component, reducing redundancy.



When Correlation is Important (When NOT to Use PCA)
---------------------------------------------------------
If we need to interpret relationships between features, PCA might not be the best choice.
Some ML models, like Decision Trees or XGBoost, work well with correlated features.
If correlation is meaningful (e.g., stock market relationships, weather patterns), PCA could remove valuable insights.


When PCA is Useful
 Dimensionality Reduction – If have too many features, PCA helps simplify the dataset.
 Noise Reduction – It removes unnecessary variations.
 Visualization – If data has many dimensions (e.g., 100 features), PCA helps convert it to 2D or 3D for easy visualization.


"""

In [None]:
"""

Import:
matplotlib.pyplot as plt.
pearsonr from scipy.stats.


Assign column 0 of grains to width and column 1 of grains to length.
Make a scatter plot with width on the x-axis and length on the y-axis.
Use the pearsonr() function to calculate the Pearson correlation of width and length

"""



# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Assign the 0th column of grains: width
width = grains[:, 0]

# Assign the 1st column of grains: length
length = grains[:, 1]

# Scatter plot width vs length
plt.scatter(width, length)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation
correlation, pvalue = pearsonr(width, length)

# Display the correlation
print(correlation)

### 0.8604149377143469


In [None]:
### Decorrelating the grain measurements with PCA

"""

You observed in the previous exercise that the width and length measurements of the grain are correlated. Now, you'll use PCA to decorrelate these measurements,
then plot the decorrelated points and measure their Pearson correlation

"""


"""

Import PCA from sklearn.decomposition.
Create an instance of PCA called model.
Use the .fit_transform() method of model to apply the PCA transformation to grains. Assign the result to pca_features.
The subsequent code to extract, plot, and compute the Pearson correlation of the first two columns pca_features has been written for you, so hit submit to see the result!

"""


# Import PCA
from sklearn.decomposition import PCA

# Create PCA instance: model
model = PCA()

# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)

# Assign 0th column of pca_features: xs
xs = pca_features[:,0]

# Assign 1st column of pca_features: ys
ys = pca_features[:,1]

# Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)

# Display the correlation
print(correlation)

## 5.4909917646575975e-17

# Intrinsic Dimension

In [None]:
"""

The intrinsic dimension of a dataset is the number of features required to approximate it. The intrinsic dimension informs dimension reduction,
because it tells us how much a dataset can be compressed.


Consider this dataset with 2 features: latitude and longitude. These two features might track the flight of an airplane, for example.
his dataset is 2-dimensional, yet it turns out that it can be closely approximated using only one feature: the displacement along the flight path.
This dataset is intrinsically one-dimensional.

"""

**PCA features are ordered by variance**

In [None]:
"""

1) How Does PCA Find Intrinsic Dimension?
---------------------------------------------

PCA helps find the intrinsic dimension by sorting the PCA features based on variance.

PCA rearranges the dataset so that the new features (PCA components) are ordered from most important to least important.
The first few PCA components will have high variance, meaning they capture most of the data's structure.
The remaining components have low variance, meaning they add little useful information.


Key Idea: The number of PCA features with high variance tells us the intrinsic dimension of the dataset.


2) How to Check Intrinsic Dimension?
--------------------------------------------
PCA provides a bar graph of variance for each PCA feature.

The first few bars are tall → These are important PCA features.
The last few bars are small → These features have low variance and can be ignored.


"""

In [None]:
### The first principal component

"""

The first principal component of the data is the direction in which the data varies the most. In this exercise, job is to use PCA to find the first principal component
of the length and width measurements of the grain samples, and represent it as an arrow on the scatter plot.

"""

"""

Make a scatter plot of the grain measurements. This has been done for you.
Create a PCA instance called model.
Fit the model to the grains data.
Extract the coordinates of the mean of the data using the .mean_ attribute of model.
Get the first principal component of model using the .components_[0,:] attribute.
Plot the first principal component as an arrow on the scatter plot, using the plt.arrow() function. You have to specify the first two arguments - mean[0] and mean[1].

"""

# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])

# Create a PCA instance: model
model = PCA()

# Fit model to points
model.fit(grains)

# Get the mean of the grain samples: mean
mean = model.mean_

# Get the first principal component: first_pc
first_pc = model.components_[0,:]

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

# Keep axes on same scale
plt.axis('equal')
plt.show()

In [None]:
### Variance of the PCA features


"""

Create an instance of StandardScaler called scaler.
Create a PCA instance called pca.
Use the make_pipeline() function to create a pipeline chaining scaler and pca.
Use the .fit() method of pipeline to fit it to the fish samples samples.
Extract the number of components used using the .n_components_ attribute of pca. Place this inside a range() function and store the result as features.
Use the plt.bar() function to plot the explained variances, with features on the x-axis and pca.explained_variance_ on the y-axis.

"""

# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)

# Fit the pipeline to 'samples'
pipeline.fit(samples)

# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

# Dimension reduction with PCA

In [None]:
"""

PCA discards the low variance features, and assumes that the higher variance features are informative

"""

**TruncatedSVD and csr_matrix**

In [None]:
"""

1) What is a Word Frequency Array?
A word frequency array is a table where:

Each row represents a document (e.g., a book, article, or review).
Each column represents a word from a fixed list of words (vocabulary).
The values in the table show how many times each word appears in each document.

we have 3 documents and a vocabulary of 5 words:

Document	"apple"	"banana"	"cat"	  "dog"	"elephant"
Doc 1       	2	      1	      0	      0	    0
Doc 2	        0	      3	      0	      1	    0
Doc 3	        0	      0	      2	      4	    1


Doc 1 mentions "apple" twice and "banana" once.
Doc 2 mentions "banana" three times and "dog" once.
Doc 3 mentions "cat", "dog", and "elephant".

Key point:
Most words do not appear in each document, so most of the values are 0. This makes the matrix sparse.

2) What is a Sparse Matrix? (csr_matrix)
--------------------------------------------
A sparse matrix is a table where most values are zero.

Instead of storing all the values, a sparse matrix only stores the non-zero values to save space.
In Python, we use a special format called csr_matrix (Compressed Sparse Row matrix) to handle sparse data efficiently.



Instead of storing:

[2, 1, 0, 0, 0]
[0, 3, 0, 1, 0]
[0, 0, 2, 4, 1]


A sparse matrix stores:

(values) → [2, 1, 3, 1, 2, 4, 1]
(positions) → [(0,0), (0,1), (1,1), (1,3), (2,2), (2,3), (2,4)]
This method saves memory when dealing with large text datasets.



3) Why Can’t We Use PCA on a Sparse Matrix?
--------------------------------------------------
PCA (Principal Component Analysis) does not work directly on csr_matrix because it expects a dense matrix (a regular table with all values).

If we try to use PCA on a word frequency array, it will be slow and use too much memory.



4) How Does TruncatedSVD Solve This Problem?
-----------------------------------------------
TruncatedSVD (Singular Value Decomposition) is like PCA but works with sparse matrices.

It finds patterns in word frequency data.
It helps reduce the number of features while keeping important information.
It is faster and memory-efficient than PCA.

    TruncatedSVD = PCA for sparse data


"""

In [None]:
"""

Import PCA from sklearn.decomposition.
Create a PCA instance called pca with n_components=2.
Use the .fit() method of pca to fit it to the scaled fish measurements scaled_samples.
Use the .transform() method of pca to transform the scaled_samples. Assign the result to pca_features.

"""

# Import PCA
from sklearn.decomposition import PCA

# Create a PCA model with 2 components: pca
pca = PCA(n_components = 2)

# Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)

# Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)

# Print the shape of pca_features
print(pca_features.shape)


In [None]:
### A tf-idf word-frequency array

"""

Import TfidfVectorizer from sklearn.feature_extraction.text.
Create a TfidfVectorizer instance called tfidf.
Apply .fit_transform() method of tfidf to documents and assign the result to csr_mat. This is a word-frequency array in csr_matrix format.
Inspect csr_mat by calling its .toarray() method and printing the result. This has been done for you.
The columns of the array correspond to words. Get the list of words by calling the .get_feature_names() method of tfidf, and assign the result to words

"""

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()

# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)

# Print result of toarray() method
print(csr_mat.toarray())

# Get the words: words
words = tfidf.get_feature_names()

# Print words
print(words)



"""

[[0.51785612 0.         0.         0.68091856 0.51785612 0.        ]
 [0.         0.         0.51785612 0.         0.51785612 0.68091856]
 [0.51785612 0.68091856 0.51785612 0.         0.         0.        ]]
['cats', 'chase', 'dogs', 'meow', 'say', 'woof']

"""

In [None]:
### Clustering Wikipedia part I

"""

TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine knowledge of TruncatedSVD and k-means
to cluster some popular pages from Wikipedia. In this exercise, build the pipeline.
In the next exercise, apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).

"""


"""

Import:
TruncatedSVD from sklearn.decomposition.
KMeans from sklearn.cluster.
make_pipeline from sklearn.pipeline.

Create a TruncatedSVD instance called svd with n_components=50.
Create a KMeans instance called kmeans with n_clusters=6.
Create a pipeline called pipeline consisting of svd and kmeans

"""


# Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

# Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components = 50)

# Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters = 6)

# Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)


In [None]:
### Clustering Wikipedia part II

"""

It is now time to put your pipeline from the previous exercise to work! You are given an array articles of tf-idf word-frequencies of some popular Wikipedia articles,
and a list titles of their titles. Use your pipeline to cluster the Wikipedia articles.

"""


"""

Import pandas as pd
Fit the pipeline to the word-frequency array articles.
Predict the cluster labels.
Align the cluster labels with the list titles of article titles by creating a DataFrame df with labels and titles as columns. This has been done for you.
Use the .sort_values() method of df to sort the DataFrame by the 'label' column, and print the result.
Hit submit and take a moment to investigate your amazing clustering of Wikipedia pages!

"""



# Import pandas
import pandas as pd

# Fit the pipeline to articles
pipeline.fit(articles)

# Calculate the cluster labels: labels
labels = pipeline.predict(articles)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})

# Display df sorted by cluster label
print(df.sort_values('label'))



"""

label                                        article
59      0                                    Adam Levine
57      0                          Red Hot Chili Peppers
56      0                                       Skrillex
55      0                                  Black Sabbath
54      0                                 Arctic Monkeys
53      0                                   Stevie Nicks
52      0                                     The Wanted
51      0                                     Nate Ruess
50      0                                   Chad Kroeger
58      0                                         Sepsis
30      1                  France national football team
31      1                              Cristiano Ronaldo
32      1                                   Arsenal F.C.
33      1                                 Radamel Falcao
37      1                                       Football
35      1                Colombia national football team
36      1              2014 FIFA World Cup qualification
38      1                                         Neymar
39      1                                  Franck Ribéry
34      1                             Zlatan Ibrahimović
26      2                                     Mila Kunis
28      2                                  Anne Hathaway
27      2                                 Dakota Fanning
25      2                                  Russell Crowe
29      2                               Jennifer Aniston
23      2                           Catherine Zeta-Jones
22      2                              Denzel Washington
21      2                             Michael Fassbender
20      2                                 Angelina Jolie
24      2                                   Jessica Biel
10      3                                 Global warming
11      3       Nationally Appropriate Mitigation Action
13      3                               Connie Hedegaard
14      3                                 Climate change
12      3                                   Nigel Lawson
16      3                                        350.org
17      3  Greenhouse gas emissions by the United States
18      3  2010 United Nations Climate Change Conference
19      3  2007 United Nations Climate Change Conference
15      3                                 Kyoto Protocol
8       4                                        Firefox
1       4                                 Alexa Internet
2       4                              Internet Explorer
3       4                                    HTTP cookie
4       4                                  Google Search
5       4                                         Tumblr
6       4                    Hypertext Transfer Protocol
7       4                                  Social search
49      4                                       Lymphoma
42      4                                    Doxycycline
47      4                                          Fever
46      4                                     Prednisone
44      4                                           Gout
43      4                                       Leukemia
9       4                                       LinkedIn
48      4                                     Gabapentin
0       4                                       HTTP 404
45      5                                    Hepatitis C
41      5                                    Hepatitis B
40      5                                    Tonsillitis

"""