# PCA

PCA - Principal Component Analysis je nenadgledana procedura za smanjenje broja atributa. Novi atributi se zovu glavne komponente. Sam PCA nije algoritam klasifikacije, ali nakon primene PCA mozemo primenjivati druge algoritme. Zasto zelimo da smanjimo broj komponenti?
    - dobijamo na brzini
    - izbegavamo probleme koji nastaju kada imamo previse atributa - udaljenost u n-dimenzionim prostorima je poprilicno besmislena, jer je tesko odrediti sta je udaljeno od cega i po kom kriterijumu

Zamislimo primer da imamo 100 atributa, verovatno vecina njih nisu toliki bitno, npr. recimo da 2 atributa nose 70% informacija, 3 85%, 4 90% i tako dalje. PCA zeli da nadje tu optimalnu granicu da "presece"

Ideja je da znatno smanjimo broj atributa uz mininmalan gubitak preciznosti.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('iris.csv')

In [4]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
X = df.drop('species', axis=1)
y = df['species']

In [6]:
from sklearn.decomposition import PCA

Da bi mogli da korsitimo PCA  moramo da skaliramo podatke

In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-0.900681,1.032057,-1.341272,-1.312977
1,-1.143017,-0.124958,-1.341272,-1.312977
2,-1.385353,0.337848,-1.398138,-1.312977
3,-1.506521,0.106445,-1.284407,-1.312977
4,-1.021849,1.263460,-1.341272,-1.312977
...,...,...,...,...
145,1.038005,-0.124958,0.819624,1.447956
146,0.553333,-1.281972,0.705893,0.922064
147,0.795669,-0.124958,0.819624,1.053537
148,0.432165,0.800654,0.933356,1.447956


In [12]:
pca = PCA(n_components=0.95) # float je % varijanse, ako stavimo int to je fiksiran broj komponenti

In [14]:
pca.fit(X_scaled)

0,1,2
,n_components,0.95
,copy,True
,whiten,False
,svd_solver,'auto'
,tol,0.0
,iterated_power,'auto'
,n_oversamples,10
,power_iteration_normalizer,'auto'
,random_state,


In [15]:
X_pca = pca.transform(X_scaled)

In [19]:
pca_names = [f'pca_{i}' for i in range(len(pca.components_))]
pca_names

['pca_0', 'pca_1']

In [20]:
X_pca = pd.DataFrame(X_pca, columns=pca_names)
X_pca

Unnamed: 0,pca_0,pca_1
0,-2.264542,0.505704
1,-2.086426,-0.655405
2,-2.367950,-0.318477
3,-2.304197,-0.575368
4,-2.388777,0.674767
...,...,...
145,1.870522,0.382822
146,1.558492,-0.905314
147,1.520845,0.266795
148,1.376391,1.016362


# SVM

Trazi pravu ili krivu ili tako nesto da razdvoji instance klasi

In [27]:
from sklearn.svm import SVC
import numpy as np

In [22]:
from sklearn.model_selection import train_test_split, GridSearchCV

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

In [24]:
scaler = StandardScaler()

In [25]:
X_train = scaler.fit_transform(X_train)

In [26]:
X_test = scaler.transform(X_test)

In [30]:
params = {
            'C': [2**i for i in range(-3, 3)],
            'kernel': ['linear', 'poly'],
            'gamma': np.arange(0,1,0.1)
        }

In [32]:
model = GridSearchCV(SVC(), param_grid=params, cv=5, scoring='accuracy', verbose=4)

In [33]:
model.fit(X_train, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits
[CV 1/5] END .C=0.125, gamma=0.0, kernel=linear;, score=1.000 total time=   0.0s
[CV 2/5] END .C=0.125, gamma=0.0, kernel=linear;, score=1.000 total time=   0.0s
[CV 3/5] END .C=0.125, gamma=0.0, kernel=linear;, score=0.952 total time=   0.0s
[CV 4/5] END .C=0.125, gamma=0.0, kernel=linear;, score=0.905 total time=   0.0s
[CV 5/5] END .C=0.125, gamma=0.0, kernel=linear;, score=1.000 total time=   0.0s
[CV 1/5] END ...C=0.125, gamma=0.0, kernel=poly;, score=0.333 total time=   0.0s
[CV 2/5] END ...C=0.125, gamma=0.0, kernel=poly;, score=0.333 total time=   0.0s
[CV 3/5] END ...C=0.125, gamma=0.0, kernel=poly;, score=0.333 total time=   0.0s
[CV 4/5] END ...C=0.125, gamma=0.0, kernel=poly;, score=0.333 total time=   0.0s
[CV 5/5] END ...C=0.125, gamma=0.0, kernel=poly;, score=0.333 total time=   0.0s
[CV 1/5] END .C=0.125, gamma=0.1, kernel=linear;, score=1.000 total time=   0.0s
[CV 2/5] END .C=0.125, gamma=0.1, kernel=linea

0,1,2
,estimator,SVC()
,param_grid,"{'C': [0.125, 0.25, ...], 'gamma': array([0. , 0....7, 0.8, 0.9]), 'kernel': ['linear', 'poly']}"
,scoring,'accuracy'
,n_jobs,
,refit,True
,cv,5
,verbose,4
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,C,0.25
,kernel,'linear'
,degree,3
,gamma,np.float64(0.0)
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


In [34]:
print(model.best_estimator_)
print(model.best_estimator_.n_support_)
print(model.best_estimator_.support_)

SVC(C=0.25, gamma=np.float64(0.0), kernel='linear')
[ 3 17 14]
[ 16  78  91   5  11  14  15  19  29  30  36  40  42  44  50  72  80  83
  96  98  18  21  24  59  62  67  70  74  77  84  88  89  99 101]


In [35]:
y_test_pred = model.predict(y_test)

ValueError: could not convert string to float: 'setosa'