# DV_FAC
* 因素分析 -> 根據構面取平均 : online_media_pp_mean / voting_mean / offline_media_pp_mean / campaign_worker_pp_mean
* 因素分析 -> 根據構面取綜合得分 : online_media_pp_score / voting_score / offline_media_pp_score / campaign_worker_pp_score
* 先分 online_pp / offline_pp -> 因素分析 -> 根據因素分析結果建立兩個綜合得分 : onlnie_scores / offlnie_scores


## Import package

In [2]:
! pip install factor_analyzer pingouin stargazer plotnine geopandas folium Factor_Analysis

In [3]:
# for data ETL
import pandas as pd
import ETL
# Module for spatial data manipulation
import geopandas as gpd
# Module for data viz 
from plotnine import *
import plotnine
import plotly.express as px
import seaborn as sns               
import matplotlib.pyplot as plt
# Module for spatial data viz
import folium
import branca
import branca.colormap as cm
from folium.features import GeoJson, GeoJsonTooltip
# for establish index
import Factor_Analysis 
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo

## Load data and filiter 

In [4]:
raw_data = pd.read_csv('raw_data.csv')

In [5]:
filter_data = raw_data.loc[raw_data['目前國內政黨當中，請問您是否偏向哪一個政黨？'].isin(['沒有特定支持', '都不支持'])]
filter_data.drop(['您有絶對的權力決定是否要參與本研究。若您願意參與，請務必勾選下列選項：', '請填寫您的電子信箱，以利後續抽獎聯繫使用'], axis=1, inplace=True)

## Data Cleaning

In [6]:
ml_df = ETL.data_cleaning(filter_data)
ml_df = ml_df.reset_index(drop=True)

In [7]:
ml_df.shape

In [8]:
ml_df = ETL.DV_feature(ml_df)
ml_df.shape

## Factor Analysis

### Filter out DV

In [None]:
DV_vars = ['TV_news_time', 'news_paper_time', 'int_news_time', 'TV_debate', 'read_media', 'like_media', 'share_media', 'comment_media', 'int_discuss', 'read_election_news', 'read_election_leaflet', 'convince', 'campaign', 'volunteer', 'election_mayor', 'election_18']
DV_df = ml_df[DV_vars]

### Bartlett’s test and Kaiser-Meyer-Olkin 

In [None]:
# Adequacy test - Bartlett's test
chiSquareValue, pValue = calculate_bartlett_sphericity(DV_df)
print('Chi-square value : {}'.format(round(chiSquareValue, ndigits = 3)))
print('p-value          : {}'.format(round(pValue, ndigits = 3)))

The Bartlett test produces a p-value that is less than 0.05. It means, we reject the null hypothesis or in this case, at least two population variances are different.

In [None]:
# Adequacy test - Kaiser-Meyer-Olkin test
KMO, KMO_model = calculate_kmo(DV_df)
print('KMO value : {}'.format(round(KMO_model, ndigits = 3)))

### Communality testing


In [None]:
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer(n_factors = 25, rotation = None)
fa.fit(DV_df)
# The communalities
df_communalities_DV = pd.DataFrame(data = {'Column': DV_df.columns, 'Communality': fa.get_communalities()})
df_communalities_DV
df_communalities_DV.style.apply(Factor_Analysis.highlightCommunalities, subset = ['Communality'])

The yellow color indicates that the communality values meet the criteria — greater than 0.5. Eliminated Variable below 0.5 .

In [None]:
# Data viz
plotnine.options.figure_size = (15, 9)
communality_bar = (
    ggplot(data = df_communalities_DV)+
    geom_bar(aes(x = 'Column',
                 y = 'Communality'),
             width = 0.75,
             stat = 'identity')+
    geom_hline(yintercept = 0.5)+
    scale_x_discrete(limits = df_communalities_DV['Column'].tolist())+
    labs(title = 'Communalitites of factor analysis')+
    xlab('Columns')+
    ylab('Communalities')+
    theme_minimal()
)
# Display the viz
communality_bar

### Reduce Factors

In [None]:
# Check Eigenvalues
eigenValue, value = fa.get_eigenvalues()

In [None]:
# Convert the results into a dataframe
df_DV_eigen = pd.DataFrame({'Factor': range(1, len(eigenValue) + 1), 'Eigen value': eigenValue})
df_DV_eigen.style.apply(Factor_Analysis.highlightEigenvalue, subset = ['Eigen value'])

According to the Kaiser criteria, the number of factors generated is 2. It means that the 16 columns or well-known variables will be grouped and interpreted into 2 factors.

In [None]:
# Data viz
plotnine.options.figure_size = (8, 4.8)
DV_scree_eigenvalue = (
    ggplot(data = df_DV_eigen)+
    geom_hline(yintercept = 1)+
    geom_line(aes(x = 'Factor',
                  y = 'Eigen value'))+
    geom_point(aes(x = 'Factor',
                   y = 'Eigen value'),
               size = 2)+
    labs(title = 'Scree plot of eigen value from factor analysis')+
    xlab('Factors')+
    ylab('Eigenvalue')+
    theme_minimal()
)
# Display the viz
DV_scree_eigenvalue

In [None]:
# Factor analysis with rotation
fa = FactorAnalyzer(n_factors = 4, rotation = 'varimax')
fa.fit(DV_df)
# Create a factor's names
facs = ['Factors' + ' ' + str(i + 1) for i in range(4)]
print(facs)
# Loading factors
pd.DataFrame(data = fa.loadings_, index = DV_df.columns, columns = facs).style.apply(Factor_Analysis.highlightLoadings)

#### 由上述結果可以發現依變數可以分為4個構面：</b>

* 線上媒體政治參與(online_media_pp) : read_media / like_media / share_media / comment_media</b>

-> reference : 劉嘉薇，2019
* 投票參與(voting) : election_mayor / election_18</b>

-> reference : Barnes and  Kaase (1979) 

* 線下媒體政治參與(offline_media_pp) : read_election_news / read_election_leaflet

-> reference : 徐火炎，2001
* 競選工作式政治參與(campaign_worker_pp) : campaign / volunteer

-> reference : Mibrath and Goel，1977

In [None]:
# Explained variance
idx = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']
df_variance = pd.DataFrame(data = fa.get_factor_variance(), index = idx, columns = facs)
# Ratio of variance
ratioVariance = fa.get_factor_variance()[1] / fa.get_factor_variance()[1].sum()
df_ratio_var = pd.DataFrame(data = ratioVariance.reshape((1, 4)), index = ['Ratio Variance'], columns = facs)
# New completed dataframe
df_variance.append(df_ratio_var)

In [None]:
df_factors = pd.DataFrame(data = fa.fit_transform(DV_df),columns = facs)
df_factors                     

### Establish Index with mean method after FAC

In [None]:
# online_media_pp_vars = ['read_media', 'like_media', 'share_media', 'comment_media']
# voting_vars = ['election_mayor', 'election_18']
# offline_media_pp_vars = ['read_election_news', 'read_election_leaflet']
# campaign_worker_pp_vars = ['campaign', 'volunteer']

In [None]:
# # Calculate the scores for each factor
# ml_df['online_media_pp_mean'] = ml_df[online_media_pp_vars].mean(axis=1)
# ml_df['voting_mean'] = ml_df[voting_vars].mean(axis=1)
# ml_df['offline_media_pp_mean'] = ml_df[offline_media_pp_vars].mean(axis=1)
# ml_df['campaign_worker_pp_mean'] = ml_df[campaign_worker_pp_vars].mean(axis=1)

### Establish Index with Factor Scores after FAC

In [None]:
# df_factors.rename(columns = {'Factors 1': 'online_media_pp_score', 
#                              'Factors 2': 'voting_score',
#                              'Factors 3': 'offline_media_pp_score',
#                              'Factors 4': 'campaign_worker_pp_score'}, inplace = True)
# ml_df = ml_df.join(df_factors)

### Establish Index with Total Scores : grouping to offlinepp and onlinepp before FAC
Total_Scores = (factor1_value * factor1_Proportion + factor2_value * factor2_Proportion) / Cumulative Variance

In [None]:
online_pp_vars = ['TV_news_time', 'news_paper_time', 'int_news_time', 'TV_debate', 'read_media', 'like_media', 'share_media', 'comment_media', 'int_discuss']
offline_pp_vars = ['read_election_news', 'read_election_leaflet', 'convince', 'campaign', 'volunteer', 'election_mayor', 'election_18']
online_pp_df = ml_df[online_pp_vars]
offline_pp_df = ml_df[offline_pp_vars]

#### Online PP

In [None]:
print('-' * 20 + " offlnie_pp " + '-' * 20)
print('\n')
print("* Bartlett's test")
chiSquareValue, pValue = calculate_bartlett_sphericity(online_pp_df)
print('Chi-square value : {}'.format(round(chiSquareValue, ndigits = 3)))
print('p-value          : {}'.format(round(pValue, ndigits = 3)))
print('\n')
print('* Kaiser-Meyer-Olkin')
KMO, KMO_model = calculate_kmo(online_pp_df)
print('KMO value : {}'.format(round(KMO_model, ndigits = 3)))

The Bartlett test produces a p-value that is less than 0.05. It means, we reject the null hypothesis or in this case, at least two population variances are different.

In [None]:
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer(n_factors = 25, rotation = None)
fa.fit(online_pp_df)
# The communalities
df_communalities_online_pp = pd.DataFrame(data = {'Column': online_pp_df.columns, 'Communality': fa.get_communalities()})
df_communalities_online_pp
df_communalities_online_pp.style.apply(Factor_Analysis.highlightCommunalities, subset = ['Communality'])

In [None]:
# Data viz
plotnine.options.figure_size = (15, 9)
communality_bar = (
    ggplot(data = df_communalities_online_pp)+
    geom_bar(aes(x = 'Column',
                 y = 'Communality'),
             width = 0.75,
             stat = 'identity')+
    geom_hline(yintercept = 0.5)+
    scale_x_discrete(limits = df_communalities_online_pp['Column'].tolist())+
    labs(title = 'Communalitites of factor analysis')+
    xlab('Columns')+
    ylab('Communalities')+
    theme_minimal()
)
# Display the viz
communality_bar

In [None]:
# Check Eigenvalues
eigenValue, value = fa.get_eigenvalues()

In [None]:
# Convert the results into a dataframe
df_online_eigen = pd.DataFrame({'Factor': range(1, len(eigenValue) + 1), 'Eigen value': eigenValue})
df_online_eigen.style.apply(Factor_Analysis.highlightEigenvalue, subset = ['Eigen value'])

In [None]:
# Data viz
plotnine.options.figure_size = (8, 4.8)
online_scree_eigenvalue = (
    ggplot(data = df_online_eigen)+
    geom_hline(yintercept = 1)+
    geom_line(aes(x = 'Factor',
                  y = 'Eigen value'))+
    geom_point(aes(x = 'Factor',
                   y = 'Eigen value'),
               size = 2)+
    labs(title = 'Scree plot of eigen value from factor analysis')+
    xlab('Factors')+
    ylab('Eigenvalue')+
    theme_minimal()
)
# Display the viz
online_scree_eigenvalue

In [None]:
# Factor analysis with rotation
fa = FactorAnalyzer(n_factors = 3, rotation = 'varimax')
fa.fit(online_pp_df)
# Create a factor's names
facs = ['Factors' + ' ' + str(i + 1) for i in range(3)]
print(facs)
# Loading factors
pd.DataFrame(data = fa.loadings_, index = online_pp_df.columns, columns = facs).style.apply(Factor_Analysis.highlightLoadings)

In [None]:
# Explained variance
idx = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']
df_variance = pd.DataFrame(data = fa.get_factor_variance(), index = idx, columns = facs)
# Ratio of variance
ratioVariance = fa.get_factor_variance()[1] / fa.get_factor_variance()[1].sum()
df_ratio_var = pd.DataFrame(data = ratioVariance.reshape((1, 3)), index = ['Ratio Variance'], columns = facs)
# New completed dataframe
df_variance.append(df_ratio_var)

In [None]:
df_factors_onlnie = pd.DataFrame(data = fa.fit_transform(online_pp_df),columns = facs)
df_factors_onlnie                     

In [None]:
# ml_df['onlnie_scores'] = (df_factors_onlnie['Factors 1'] * fa.get_factor_variance()[1][0] + df_factors_onlnie['Factors 2'] * fa.get_factor_variance()[1][1] + df_factors_onlnie['Factors 3'] * fa.get_factor_variance()[1][2]) / fa.get_factor_variance()[1].sum()

#### Offline PP

In [None]:
print('-' * 20 + " offlnie_pp " + '-' * 20)
print('\n')
print("* Bartlett's test")
chiSquareValue, pValue = calculate_bartlett_sphericity(offline_pp_df)
print('Chi-square value : {}'.format(round(chiSquareValue, ndigits = 3)))
print('p-value          : {}'.format(round(pValue, ndigits = 3)))
print('\n')
print('* Kaiser-Meyer-Olkin')
KMO, KMO_model = calculate_kmo(offline_pp_df)
print('KMO value : {}'.format(round(KMO_model, ndigits = 3)))

In [None]:
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer(n_factors = 25, rotation = None)
fa.fit(offline_pp_df)
# The communalities
df_communalities_offline_pp = pd.DataFrame(data = {'Column': offline_pp_df.columns, 'Communality': fa.get_communalities()})
df_communalities_offline_pp
df_communalities_offline_pp.style.apply(Factor_Analysis.highlightCommunalities, subset = ['Communality'])

In [None]:
# Data viz
plotnine.options.figure_size = (15, 9)
communality_bar = (
    ggplot(data = df_communalities_offline_pp)+
    geom_bar(aes(x = 'Column',
                 y = 'Communality'),
             width = 0.75,
             stat = 'identity')+
    geom_hline(yintercept = 0.5)+
    scale_x_discrete(limits = df_communalities_offline_pp['Column'].tolist())+
    labs(title = 'Communalitites of factor analysis')+
    xlab('Columns')+
    ylab('Communalities')+
    theme_minimal()
)
# Display the viz
communality_bar

In [None]:
# Check Eigenvalues
eigenValue, value = fa.get_eigenvalues()

In [None]:
# Convert the results into a dataframe
df_offline_eigen = pd.DataFrame({'Factor': range(1, len(eigenValue) + 1), 'Eigen value': eigenValue})
df_offline_eigen.style.apply(Factor_Analysis.highlightEigenvalue, subset = ['Eigen value'])

In [None]:
# Data viz
plotnine.options.figure_size = (8, 4.8)
offline_scree_eigenvalue = (
    ggplot(data = df_offline_eigen)+
    geom_hline(yintercept = 1)+
    geom_line(aes(x = 'Factor',
                  y = 'Eigen value'))+
    geom_point(aes(x = 'Factor',
                   y = 'Eigen value'),
               size = 2)+
    labs(title = 'Scree plot of eigen value from factor analysis')+
    xlab('Factors')+
    ylab('Eigenvalue')+
    theme_minimal()
)
# Display the viz
offline_scree_eigenvalue

In [None]:
# Factor analysis with rotation
fa = FactorAnalyzer(n_factors = 2, rotation = 'varimax')
fa.fit(offline_pp_df)
# Create a factor's names
facs = ['Factors' + ' ' + str(i + 1) for i in range(2)]
print(facs)
# Loading factors
pd.DataFrame(data = fa.loadings_, index = offline_pp_df.columns, columns = facs).style.apply(Factor_Analysis.highlightLoadings)

In [None]:
# Explained variance
idx = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']
df_variance = pd.DataFrame(data = fa.get_factor_variance(), index = idx, columns = facs)
# Ratio of variance
ratioVariance = fa.get_factor_variance()[1] / fa.get_factor_variance()[1].sum()
df_ratio_var = pd.DataFrame(data = ratioVariance.reshape((1, 2)), index = ['Ratio Variance'], columns = facs)
# New completed dataframe
df_variance.append(df_ratio_var)

In [None]:
# df_factors_offlnie = pd.DataFrame(data = fa.fit_transform(offline_pp_df),columns = facs)
# df_factors_offlnie                     

In [None]:
# ml_df['offlnie_scores'] = (df_factors_offlnie['Factors 1'] * fa.get_factor_variance()[1][0] + df_factors_offlnie['Factors 2'] * fa.get_factor_variance()[1][1]) / fa.get_factor_variance()[1].sum()