### Introduction

#### Step-by-Step Process Before PCA

Below is a structured sequence of steps to follow before deciding whether PCA should be applied to your dataset. I am keeping all the details intact from my earlier explanation, and this time I am also adding a list of plots you can make at relevant steps, along with what they can help you discover.

---

#### Step 1. **Understand the Data and Features**
- Carefully read through your dataset and understand what each column represents (which you’ve already started doing — e.g., GDP per capita, education expenditure, healthcare expenditure, unemployment rate, crime rate).
- Clarify whether each column is measured at a national level (macro indicators) or at a regional/city/state level.
- Check if features are measured in the same unit (like USD vs. INR, or percentages vs. absolute values).  

**Plots to make**  
- *Bar chart or pie chart*: to visualize the proportion of spending categories (like education vs. health) if data is in percentages.  
- *Line plots*: to see how indicators like GDP, education spending, healthcare spending have changed over time.  

---

#### Step 2. **Check Data Quality**
- Look for missing values, duplicated rows, inconsistent formatting.  
- For numerical columns, check if extreme outliers exist.  

**Plots to make**  
- *Histogram*: to view distribution of each feature and spot skewness.  
- *Boxplot*: to detect outliers for each feature.  

---

#### Step 3. **Check Scale of Variables**
- PCA is very sensitive to scale because it uses variance. A feature with very large values (like GDP in trillions) can dominate features with smaller values (like unemployment rate in percentage).
- For example, GDP per capita could be in thousands, education expenditure per capita in hundreds, and unemployment in single digits. If left unscaled, GDP will heavily dominate the PCA.  
- Therefore, **standardization (z-score scaling)** is almost always done before PCA (mean = 0, std = 1 for each feature).  
- If some features are ratios/percentages, they’re already comparable, but it’s still best practice to standardize everything together.  

**Plots to make**  
- *Density plots (before and after scaling)*: to confirm features are now on comparable scales.  

---

#### Step 4. **Check Multicollinearity Between Features**
- PCA shines when features are correlated/redundant. If most features are independent, PCA won’t add much value.  
- Compute the correlation matrix. Look for groups of features with high correlation (positive or negative).  
- Example: Education expenditure per capita and healthcare expenditure per capita may move together because both are parts of government spending.  

**Plots to make**  
- *Heatmap of correlation matrix*: to visualize correlation patterns among features.  
- *Pairplot (scatterplot matrix)*: to visually see relationships between features.  

---

#### Step 5. **Suitability for PCA**
- Not all datasets are appropriate for PCA.  
- Statistical checks:  
  - **KMO (Kaiser-Meyer-Olkin) test** → measures sampling adequacy. Higher (>0.6) means PCA likely useful.  
  - **Bartlett’s test of sphericity** → checks whether correlation matrix is significantly different from identity matrix. Significant result means PCA makes sense.  
- These tests basically confirm that your dataset has redundancy/overlap and PCA can meaningfully reduce dimensions.  

**Plots to make**  
- *Scree plot*: to check eigenvalues (important later, but useful for suitability too).  

---

#### Step 6. **Decide Your Goal**
- Think clearly why you want PCA.  
  - Do you want to reduce the dataset from 6–7 indicators into 2–3 combined indices (like “economic strength index”, “social well-being index”)?  
  - Or do you want to use PCA just as preprocessing before clustering/ML?  
- Having this goal in mind helps you interpret the principal components later.  

---

#### Step 7. **Run PCA and Analyze Variance Explained**
- When you run PCA, check the eigenvalues and variance explained by each component.  
- Typically, first few components explain most of the variance.  
- Use a **scree plot** (elbow method) to decide how many PCs to retain.  

**Plots to make**  
- *Scree plot*: shows how much variance each component explains.  
- *Cumulative variance plot*: to see how many components together capture 80–90% of variance.  

---

#### Step 8. **Interpret the Components**
- Look at the loadings (coefficients of each original feature in the principal components).  
- Try to interpret them meaningfully. Example:  
  - PC1 may load heavily on GDP, education, and healthcare → representing “economic development”.  
  - PC2 may load heavily on unemployment and crime → representing “social stress”.  
- These interpretations help you understand underlying dimensions that drive the dataset.  

**Plots to make**  
- *Biplot (PC1 vs PC2 with loadings)*: shows both data points and how features contribute.  
- *Component loading heatmap*: to see which variables strongly influence each component.  
- *Scatter plot of first two PCs*: to observe clusters or patterns in reduced space.  

---

#### Summary
The general workflow before PCA:  
1. Understand dataset and units  
2. Check data quality  
3. Standardize/scale features  
4. Check correlations (redundancy)  
5. Test suitability for PCA (KMO, Bartlett)  
6. Clarify your analysis goal  
7. Run PCA and study variance explained  
8. Interpret principal components  

With these steps and the suggested plots, you’ll be able to both validate whether PCA is appropriate and also extract meaningful insights.


In [1]:
#import libraries
from docx import Document
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

I have a word file with 3 dataset tables.<br>
I will be extracting those tables using <code><b>docx</b></code> library.

In [2]:
#load the data,
doc = Document('data/pca_datasets.docx')

##### Step 1: Extract Table from MS-Word and create pandas df

In [3]:
#taking the first table
table1 = doc.tables[0]
#Dictionary to store each table-column as one entry 
data_table={}

In [4]:
#Create a dictionary with column-header as key and remaining data in that column as list-of-values.
#e.g. {"Region":"['A', 'B',...]", "GDP":"[5.3, 2.2,...]"}
for column in table1.columns:
    col = [cell.text.strip().replace('\n', ' ') for cell in column.cells]
    data_table[col[0]] = col[1:]

In [5]:
df = pd.DataFrame(data_table)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column                                   Non-Null Count  Dtype 
---  ------                                   --------------  ----- 
 0   Region                                   10 non-null     object
 1   GDP per capita (USD)                     10 non-null     object
 2   Unemployment Rate (%)                    10 non-null     object
 3   Education Expenditure per capita (USD)   10 non-null     object
 4   Healthcare Expenditure per capita (USD)  10 non-null     object
 5   Crime Rate (per 1000 inhabitants)        10 non-null     object
dtypes: object(6)
memory usage: 612.0+ bytes


In the code below, the column-names are being simplied for easier handling.  
The original & new name are present in table below:
<table style="border:2px solid black;">
    <tr style="border:1px solid black;">
        <th style="border:1px solid black;">Original colname</th>
        <th style="border:1px solid black;">New colname</th>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">Region</td>
        <td style="border:1px solid black;">region</td>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">GDP per capita (USD)</td>
        <td style="border:1px solid black;">gdp</td>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">Unemployment Rate(%)</td>
        <td style="border:1px solid black;">umemploy_rate</td>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">Education Expenditure per capita (USD)</td>
        <td style="border:1px solid black;">ed_expend</td>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">Healthcare Expenditure per capita (USD)</td>
        <td style="border:1px solid black;">health_expend</td>
    </tr>
    <tr style="border:1px solid black;">
        <td style="border:1px solid black;">Crime Rate (per 1000 inhabitants)</td>
        <td style="border:1px solid black;">crime_rate</td>
    </tr>
</table>

##### Step 2: Clean dataframe and view feature distributions

In [7]:
#rename columns
df.rename(columns={"Region":"region", "GDP per capita (USD)":"gdp", 
                   "Unemployment Rate (%)":"unemploy_rate", 
                   "Education Expenditure per capita (USD)":"ed_expend",
                  "Healthcare Expenditure per capita (USD)":"health_expend",
                  "Crime Rate (per 1000 inhabitants)":"crime_rate"},
         inplace=True)

In [8]:
#View the dataframe
df

Unnamed: 0,region,gdp,unemploy_rate,ed_expend,health_expend,crime_rate
0,A,25000,5,1500,2000,10
1,B,22000,7,1200,1800,15
2,C,28000,4,1800,2200,8
3,D,20000,9,1000,1500,20
4,E,30000,3,2000,2500,5
5,F,26000,6,1600,2100,12
6,G,24000,8,1300,1900,18
7,H,27000,5,1700,2300,9
8,I,23000,4,1400,1700,14
9,J,29000,2,1900,2400,7


In [10]:
#Fix data type of numeric columns
df[['gdp', 'unemploy_rate', 'ed_expend', 'health_expend','crime_rate']] = df[['gdp', 'unemploy_rate', 'ed_expend', 'health_expend','crime_rate']].astype(float)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   region         10 non-null     object 
 1   gdp            10 non-null     float64
 2   unemploy_rate  10 non-null     float64
 3   ed_expend      10 non-null     float64
 4   health_expend  10 non-null     float64
 5   crime_rate     10 non-null     float64
dtypes: float64(5), object(1)
memory usage: 612.0+ bytes


##### Step 3: Standardizing features (Fix scale of all features)

<div style="background-color:#FFE6E6; color:red;padding:5px 5px;margin:0px">
<i>What does mean=0, std=1 actually do?</i><br>
<i>How does it affect the assembly of data points?</i><br>
[Find out the difference to build visual intuition]
</div>

In [14]:
df2 = df.drop("region", axis=1)

In [17]:
#Standardize: all features have mean=0, stadard dev=1
df2 = (df2 - df2.mean())/df2.std()

In [18]:
df2

Unnamed: 0,gdp,unemploy_rate,ed_expend,health_expend,crime_rate
0,-0.124838,-0.135526,-0.124838,-0.124838,-0.367764
1,-1.061119,0.767982,-1.061119,-0.749025,0.653803
2,0.811444,-0.58728,0.811444,0.49935,-0.776391
3,-1.685307,1.67149,-1.685307,-1.685307,1.67537
4,1.435632,-1.039034,1.435632,1.435632,-1.389331
5,0.187256,0.316228,0.187256,0.187256,0.040863
6,-0.436931,1.219736,-0.749025,-0.436931,1.266743
7,0.49935,-0.135526,0.49935,0.811444,-0.572078
8,-0.749025,-0.58728,-0.436931,-1.061119,0.44949
9,1.123538,-1.490788,1.123538,1.123538,-0.980704
