# **PCA Analysis**

PCA stands for Principal Component Analysis, a dimensionality reduction technique widely used in data analysis and machine learning. 

PCA Analysis is a powerful technique that helps in reducing the complexity of datasets while retaining their essential features. 

PCA helps us in getting rid of non-important features allow our models to run faster.

It works by transforming the data into a new coordinate system, where the greatest variance lies on the first axis, the second greatest variance on the second axis, and so on. This transformation helps in visualizing and understanding the underlying structure of the data.

It transforms a dataset with many variables into a smaller set of uncorrelated variables called principal components, which capture the most important variations in the data. This helps in data compression, visualization, and feature extraction.

It is particularly useful when dealing with high-dimensional data, as it allows for more efficient processing and analysis.

PCA is widely used in various fields, including image processing, finance, and bioinformatics, to simplify data analysis and improve model performance.

It is important to note that PCA is a linear technique, meaning it assumes that the relationships between features are linear. For datasets with non-linear relationships, other techniques like t-SNE or UMAP may be more appropriate.

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
tables = pd.read_html("https://www.baseball-reference.com/leagues/majors/2022.shtml", flavor="lxml")

# Save the first table (Team Standard Batting) to CSV
tables[0].to_csv("mlb_batting_2022.csv", index=False)
print("✅ CSV saved successfully!")

✅ CSV saved successfully!


In [17]:
df = pd.read_csv('mlb_batting_2022.csv')

display(df.head())

Unnamed: 0,Tm,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,...,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,Arizona Diamondbacks,57,26.5,4.33,162,6027,5351,702,1232,262,...,0.385,0.689,95,2061,97,60,31,50,14,1039
1,Atlanta Braves,53,27.5,4.87,162,6082,5509,789,1394,298,...,0.443,0.761,109,2443,103,66,1,36,13,1030
2,Baltimore Orioles,58,27.0,4.16,162,6049,5429,674,1281,275,...,0.39,0.695,99,2119,95,83,12,43,10,1095
3,Boston Red Sox,54,28.8,4.54,162,6144,5539,735,1427,352,...,0.409,0.731,102,2268,131,63,12,50,23,1133
4,Chicago Cubs,64,27.9,4.06,162,6072,5425,657,1293,265,...,0.387,0.698,94,2097,130,84,19,36,16,1100


In [19]:
df.drop(columns=['Tm'], axis=1, inplace=True)

In [20]:
df.head()

Unnamed: 0,#Bat,BatAge,R/G,G,PA,AB,R,H,2B,3B,...,SLG,OPS,OPS+,TB,GDP,HBP,SH,SF,IBB,LOB
0,57,26.5,4.33,162,6027,5351,702,1232,262,24,...,0.385,0.689,95,2061,97,60,31,50,14,1039
1,53,27.5,4.87,162,6082,5509,789,1394,298,11,...,0.443,0.761,109,2443,103,66,1,36,13,1030
2,58,27.0,4.16,162,6049,5429,674,1281,275,25,...,0.39,0.695,99,2119,95,83,12,43,10,1095
3,54,28.8,4.54,162,6144,5539,735,1427,352,12,...,0.409,0.731,102,2268,131,63,12,50,23,1133
4,64,27.9,4.06,162,6072,5425,657,1293,265,31,...,0.387,0.698,94,2097,130,84,19,36,16,1100


In [24]:
df.columns

Index(['#Bat', 'BatAge', 'R/G', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR',
       'RBI', 'SB', 'CS', 'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB',
       'GDP', 'HBP', 'SH', 'SF', 'IBB', 'LOB'],
      dtype='object')