1. PCA decreases the dimensions of data, or in other words, projects data to fewer dimensions, which allows to see the relation and identify possible patterns between data.

In [32]:
import pandas as pd
from sklearn.decomposition import PCA
import numpy as np

2. Number of columns of sensors1.csv needed to explain 90% of the data

In [34]:
sensors1 = pd.read_csv('sensors1.csv')
sensors1 = sensors1.fillna(sensors1.mean())

pca = PCA()
pca.fit(sensors1)

explained_variance = np.cumsum(pca.explained_variance_ratio_)

n_components = np.argmax(explained_variance >= 0.90) + 1
print(f'Number of components needed: {n_components}')

Number of components needed: 3


3. Constructing the linear model

In [35]:
y = sensors1.iloc[:, -1]
X = sensors1.iloc[:, :-1]

pca_n = PCA(n_components=n_components)
principal_components = pca_n.fit_transform(X)
pc_df = pd.DataFrame(principal_components)

model = LinearRegression()
model.fit(pc_df, y)

y_pred = model.predict(pc_df)
r2 = r2_score(y, y_pred)

print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)
print('R²:', r2)

Intercept: 761.0570353035304
Coefficients: [ 0.6253928   0.28164343 -0.3615673 ]
R²: 0.9960376968428102


4. Help Krystor determine whether sensors2.csv also contains the output of the sensors in the electric circuit of Reapor Rich’s new cinema.

In [36]:
common_columns = sensors1.columns.intersection(sensors2.columns)
sensors1 = sensors1[common_columns]
sensors2 = sensors2[common_columns]

# Preprocess the data: handle missing values
sensors1 = sensors1.fillna(sensors1.mean())
sensors2 = sensors2.fillna(sensors2.mean())

# Standardize the data separately
scaler1 = StandardScaler()
scaler2 = StandardScaler()
sensors1_scaled = scaler1.fit_transform(sensors1)
sensors2_scaled = scaler2.fit_transform(sensors2)

# Perform PCA on both datasets independently
pca1 = PCA()
pca2 = PCA()
sensors1_pca = pca1.fit_transform(sensors1_scaled)
sensors2_pca = pca2.fit_transform(sensors2_scaled)

# Get the explained variance ratio for the first few components
explained_variance_ratio1 = pca1.explained_variance_ratio_
explained_variance_ratio2 = pca2.explained_variance_ratio_

# Calculate the correlation of explained variance ratios
min_components = min(len(explained_variance_ratio1), len(explained_variance_ratio2))
correlation = np.corrcoef(explained_variance_ratio1[:min_components], explained_variance_ratio2[:min_components])[0, 1]

# Output the results
results = pd.DataFrame({
    'Component': range(1, min_components + 1),
    'Explained Variance Ratio (Sensors1)': explained_variance_ratio1[:min_components],
    'Explained Variance Ratio (Sensors2)': explained_variance_ratio2[:min_components],
    'Correlation': [correlation] * min_components
})

print(results)

    Component  Explained Variance Ratio (Sensors1)  \
0           1                             0.582098   
1           2                             0.261367   
2           3                             0.146835   
3           4                             0.002251   
4           5                             0.001864   
5           6                             0.001279   
6           7                             0.000965   
7           8                             0.000846   
8           9                             0.000712   
9          10                             0.000563   
10         11                             0.000487   
11         12                             0.000414   
12         13                             0.000318   

    Explained Variance Ratio (Sensors2)  Correlation  
0                              0.581767     0.997839  
1                              0.288673     0.997839  
2                              0.118173     0.997839  
3                      

There is a high correlation, so sensors2 high likely contains the output of the sensors in the electric circuit of Reapor Rich’s new cinema.