### PCA - Principal Component Analysis  
--------------------

Dataset: Online News Popularity  
Данные о публикации новостей и количестве репостов в социальных сетях  
Target variable: "shares" - number of shares

**Content**  
[1. Imports](#1.-Imports)  
[2. Data loading](#2.-Data-loading)  
[3. Data exploration](#3.-Data-exploration)  
[4. Data preparation](#4.-Data-preparation)  
[5.1. PCA including categories and numerical data](#5.1.-PCA-including-categories-and-numerical-data)  
[5.2. PCA including only numerical data](#5.2.-PCA-including-only-numerical-data)  
[6. Model evaluation](#6.-Model-evaluation)  
[References](#References)

#### 1. Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import decomposition
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [2]:
# Function for model accuracy evaluation

def model_score(X, y):
    # Split the data into train and test parts
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    # Model training
    model = RandomForestClassifier(max_depth=2, random_state=1)
    model.fit(X_train, y_train)
    score = accuracy_score(y_test, model.predict(X_test))
    print('accuracy:', round(score*100, 1), '%')

#### 2. Data loading

In [3]:
data = pd.read_csv('data/OnlineNewsPopularity.zip', compression='zip')
data.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


#### 3. Data exploration

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 61 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   url                             39644 non-null  object 
 1    timedelta                      39644 non-null  float64
 2    n_tokens_title                 39644 non-null  float64
 3    n_tokens_content               39644 non-null  float64
 4    n_unique_tokens                39644 non-null  float64
 5    n_non_stop_words               39644 non-null  float64
 6    n_non_stop_unique_tokens       39644 non-null  float64
 7    num_hrefs                      39644 non-null  float64
 8    num_self_hrefs                 39644 non-null  float64
 9    num_imgs                       39644 non-null  float64
 10   num_videos                     39644 non-null  float64
 11   average_token_length           39644 non-null  float64
 12   num_keywords                   

В названиях всех столбцов кроме столбца url слева присутствует пробел.  
Столбец url не влияет ни на что, удалим его.

In [5]:
# removing spaces in column names
data.columns = data.columns.str.strip()

# drop column url
data = data.drop(columns='url')

We will classify popular and unpopular news using threshold of 1400 shares (mediana).

In [6]:
# Create 2 classes
data['popularity']=data['shares'].apply(lambda x: 1 if x>=1400 else 0)

In [7]:
data[['shares', 'popularity']].head()

Unnamed: 0,shares,popularity
0,593,0
1,711,0
2,1500,1
3,1200,0
4,505,0


As we redefined the new target value (popularity), let's drop 'shares'.

In [8]:
data = data.drop(columns="shares")

Несмотря на то, что почти все переменные имеют тип float, некоторые из них представляют собой бинарные признаки со значениями 0/1.

In [9]:
category = []
for column in data.columns[:-1]:   # except column 'popularity'
    if data[column].nunique() < 5:
        print(column, data[column].unique())
        category.append(column)    # saving categorical names to list

data_channel_is_lifestyle [0. 1.]
data_channel_is_entertainment [1. 0.]
data_channel_is_bus [0. 1.]
data_channel_is_socmed [0. 1.]
data_channel_is_tech [0. 1.]
data_channel_is_world [0. 1.]
weekday_is_monday [1. 0.]
weekday_is_tuesday [0. 1.]
weekday_is_wednesday [0. 1.]
weekday_is_thursday [0. 1.]
weekday_is_friday [0. 1.]
weekday_is_saturday [0. 1.]
weekday_is_sunday [0. 1.]
is_weekend [0. 1.]


We will compare using binary predictors in dataset decomposition versus dimensionality reduction with only numerical values.  

#### 4. Data preparation

In [10]:
# Splitting the data into X and y (target)

X = data.copy()
y = X.pop('popularity')   # возвращает элемент датафрейма и удаляет этот элемент из датафрейма
y

0        0
1        0
2        1
3        0
4        0
        ..
39639    1
39640    1
39641    1
39642    0
39643    0
Name: popularity, Length: 39644, dtype: int64

#### 5.1. PCA including categories and numerical data

In [11]:
# Standardization of X

scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X_std

array([[ 1.75788035,  0.75744723, -0.69521045, ..., -0.97543219,
        -1.81071884,  0.13891975],
       [ 1.75788035, -0.66165665, -0.61879381, ..., -0.26907618,
         0.83774863, -0.68965812],
       [ 1.75788035, -0.66165665, -0.71219192, ..., -0.26907618,
         0.83774863, -0.68965812],
       ...,
       [-1.61808342, -0.18862202, -0.2218518 , ...,  0.24463729,
        -1.56994907, -0.08705603],
       [-1.61808342, -2.08076053,  0.28759248, ..., -0.26907618,
         0.83774863, -0.68965812],
       [-1.61808342, -0.18862202, -0.82681689, ...,  0.67273184,
        -0.92789635,  0.41511238]])

In [12]:
# set explained variance greater than 90%

pca = decomposition.PCA(n_components=0.9)
X_pca = pca.fit_transform(X_std)

In [13]:
print('количество исходных компонент', X.shape[1])
print('количество новых компонент', pca.n_components_)

количество исходных компонент 59
количество новых компонент 31


In [14]:
# вклад дисперсий новых компонент
pca.explained_variance_ratio_

array([0.08275377, 0.06985565, 0.06110151, 0.0508775 , 0.0476545 ,
       0.04397503, 0.04309095, 0.03886672, 0.03606727, 0.03525317,
       0.03393574, 0.03195266, 0.02806568, 0.02339546, 0.02313933,
       0.02094133, 0.02082309, 0.02053416, 0.02022005, 0.01987701,
       0.01914187, 0.01839669, 0.0177176 , 0.01543416, 0.01476361,
       0.01378702, 0.01224799, 0.01147547, 0.01112622, 0.01035957,
       0.00973298])

In [15]:
# доля сохраненной дисперсии после понижения размерности
sum(pca.explained_variance_ratio_)

0.9065637591362583

#### 5.2. PCA including only numerical data

Разделим признаки X на два датафрейма с категориальными и числовыми переменными.

In [16]:
X_num = X.drop(columns=category)
X_cat = X[category]
print(X_num.shape)
print(X_cat.shape)

(39644, 45)
(39644, 14)


In [17]:
scaler_num, scaler_cat = StandardScaler(), StandardScaler()
Xnum_std = scaler_num.fit_transform(X_num)
Xcat_std = scaler_cat.fit_transform(X_cat)

In [18]:
pca_num = decomposition.PCA(n_components=0.9)
Xnum_pca = pca_num.fit_transform(Xnum_std)

In [20]:
# доля сохраненной дисперсии после понижения размерности
sum(pca_num.explained_variance_ratio_)

0.908321934327797

In [21]:
# Join reduced data with categorical data
X_join = np.hstack((Xnum_pca, Xcat_std))
X_join.shape

(39644, 38)

After combining PCA-reduced numerical components with categorical variables we got 38 components from 59 initial components. While reducing both numerical and categorical features we got 31 new components from 59.

#### 6. Model evaluation

According to [1] the best classification model is **Random Forest**.

In [30]:
# After PCA including categories and numerical data
model_score(X_pca, y)

accuracy: 60.3 %


In [28]:
# After PCA including only numerical data
model_score(X_join, y)

accuracy: 60.5 %


There is no significant difference in resulting accuracy between applying PCA only to numerical features or to all features.  
So we can go ahead applying PCA to all features.  


#### References  
  
1. K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
    Support System for Predicting the Popularity of Online News. Proceedings
    of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
    September, Coimbra, Portugal.

About PCA  
- https://habr.com/ru/post/507618/  
- https://habr.com/ru/post/304214/  
- http://mathprofi.ru/sobstvennye_znachenija_i_sobstvennye_vektory.html  
- https://en.wikipedia.org/wiki/Orthogonal_matrix