CBS International Business School
Leveraging Big Data Analytics for Enhanced Financial Inclusion in Nigeria: A Comprehensive Analysis of Strategies, Impact, And Challenges in Developing Economies.
Master of Arts in International Business
Digital Transformation Management
Prof. Dr. Geoffrey Writes
- Name: Kosisochukwu Cypran Akubude
- Student Number: 1195501031
- Date: April 14, 2024
- Summary
- Phase 1: Ask
- Phase 2: Prepare
- 3.1. Data Location
- 3.2. Data Organization
- 3.3. Data Credibility
- 3.4. Data License and Privacy
- Phase 3: Process
- Phase 4: Analyze
- 5.1. Regression
- 5.2. Pearson Correlation Test
- 5.3. Test for Multicollinearity
- 5.4. Descriptive Statistics
- 5.5. Hypothesis Testing
- Phase 5: Share
- Phase 6: Conclusion (Act)
The purpose of this study is to analyze the relationship between financial inclusion and big data, including economic data variables. We aim towards analysing the datasets and discuss findings. The following tests will be conducted in this analysis.
- Regression
- Pearson Correlation Test
- Test for Multicollinearity
- Descriptive Statistics
- Hypothesis Testing
Questions guiding our analysis incude:
- What is the expected relationship between financial inclusion and big data?
- Do big data sources have a significant impact on increased access to financial services?
- Does an economy with a structured data, accessible data source, and data infrastructure help streamline digital financial inclusion?
3.1. Data Location: All data used in this study were obtained from the World Bank DataBank. Data variables for Model 1 were obtained from the Global Findex Database whilst data variables for Model 2 were obtained from the Statistical Performance Indicator Database.
3.2. Data Organization: The extracted datasets from the Global Findex dataset comprised 1,450 rows whilst the SPI dataset extracted has 625 rows. Each dataset were organized by columns such as country names, series or variable description, and the corresponding years for each country. The downloaded data have already been cleaned and orginized in Excel before being imported for analysis with Python. The data used in this study therefore are organized by the coutnries and data variables in different columns.
3.3. Data Credibility: The data used are based on the survey of adults from the age of 15 and above accross all countries. The data were last updated in 2021. Only values for the year 2021 are considered in this analysis due to missing data for the previous year.
3.4. Data License and Privacy: The metadata from the datasets shows that its an open-source data which justifies the use for this study. The licence can be accessed via this link: https://datacatalog.worldbank.org/public-licenses#cc-by
Excel has been used to clean and restructure the dataset in a way that is more convenient for analysis. All variables needed for the study have been selected and placed in column sections. The datasets used in this study are named as follow:
-
model1: Contains financial inclusion and big data source variables.
-
model2: Contains financial inclusion and SPI variables.
#importing relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import pearsonr
#importing model1 dataframe and indexing country column to analyse on numerical data types
model1 = pd.read_csv("model1.csv")
model2 = pd.read_csv('model2.csv')
display(model1.head())
display(model2.head())
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
country | internet_access | mobile_utility_payment | digital_payment | mobile_phone_ownership | mobile_bill_payment | mobile_money_account | |
---|---|---|---|---|---|---|---|
0 | Algeria | 82.91 | 1.27 | 33.74 | 95.18 | 3.57 | 0.00 |
1 | Benin | 28.85 | 5.31 | 43.69 | 72.39 | 7.17 | 36.68 |
2 | Botswana | 50.40 | 15.81 | 51.78 | 88.43 | 20.28 | 36.57 |
3 | Burkina Faso | 23.76 | 6.23 | 33.29 | 76.13 | 9.83 | 24.67 |
4 | Cameroon | 40.48 | 8.60 | 49.85 | 75.23 | 11.59 | 42.43 |
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
country | source_data_capacity | spi_data_use | spi_data_services | spi_data_products | spi_data_sources | spi_data_infrastructure | account | |
---|---|---|---|---|---|---|---|---|
0 | Benin | 60.0 | 80.0 | 69.90 | 83.63 | 29.18 | 50.0 | 48.61 |
1 | Botswana | 60.0 | 60.0 | 67.53 | 77.80 | 61.53 | 40.0 | 58.76 |
2 | Burkina Faso | 60.0 | 80.0 | 69.03 | 81.46 | 36.66 | 60.0 | 36.11 |
3 | Cameroon | 20.0 | 60.0 | 63.87 | 82.09 | 23.34 | 35.0 | 51.65 |
4 | Chad | 50.0 | 70.0 | 57.03 | 75.84 | 18.02 | 30.0 | 23.65 |
#checking and confirming data types
model1.info()
print("\n")
model2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 40 non-null object
1 internet_access 40 non-null float64
2 mobile_utility_payment 40 non-null float64
3 digital_payment 40 non-null float64
4 mobile_phone_ownership 40 non-null float64
5 mobile_bill_payment 40 non-null float64
6 mobile_money_account 40 non-null float64
dtypes: float64(6), object(1)
memory usage: 2.3+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 35 non-null object
1 source_data_capacity 35 non-null float64
2 spi_data_use 35 non-null float64
3 spi_data_services 35 non-null float64
4 spi_data_products 35 non-null float64
5 spi_data_sources 35 non-null float64
6 spi_data_infrastructure 35 non-null float64
7 account 35 non-null float64
dtypes: float64(7), object(1)
memory usage: 2.3+ KB
We can confirm from the datasets review that all data variables are numeric whch is essential for statistical analysis. There is only one categorical variable in the dataset, 'country' which contains a list of all countries used in the study for each model. The country column is not used for statistical analysis as it is a non numerical variable. The variable could be used only when creating visual charts.
# Checking number of countries or obervations in each dataset
country_count_model_1 = len(model1['country'].unique())
country_count_model_2 = len(model2['country'].unique())
print("Number of countries for Model 1:", country_count_model_1)
print("Number of countries for Model 2:", country_count_model_2)
Number of countries for Model 1: 40
Number of countries for Model 2: 35
The dataset contain 40 countries and 35 countries for model 1 and 2 respectively. The countries are also referred to as the observations. Therefore, we have 40 observations for model 1 and 35 observations for model 2.
# Indicating and separating the dependent and independent variables from the model 1 dataframe
#Dependent variable definition
dep_var_model1 = model1['mobile_money_account']
#Independent variable definition
indep_var_model1 = model1[['internet_access', 'mobile_utility_payment',
'digital_payment', 'mobile_phone_ownership',
'mobile_bill_payment']]
# Adding a constant term to the independent variables for the intercept
indep_var_model1 = sm.add_constant(indep_var_model1)
# Fitting the OLS regression model
ols_model1 = sm.OLS(dep_var_model1, indep_var_model1).fit()
# Printing model summary
print(ols_model1.summary())
OLS Regression Results
================================================================================
Dep. Variable: mobile_money_account R-squared: 0.858
Model: OLS Adj. R-squared: 0.837
Method: Least Squares F-statistic: 41.18
Date: Tue, 23 Apr 2024 Prob (F-statistic): 1.77e-13
Time: 16:59:14 Log-Likelihood: -135.17
No. Observations: 40 AIC: 282.3
Df Residuals: 34 BIC: 292.5
Df Model: 5
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
const -7.4068 7.749 -0.956 0.346 -23.155 8.341
internet_access -0.5527 0.125 -4.411 0.000 -0.807 -0.298
mobile_utility_payment 0.7147 0.415 1.721 0.094 -0.129 1.559
digital_payment 0.6621 0.119 5.579 0.000 0.421 0.903
mobile_phone_ownership 0.2944 0.161 1.824 0.077 -0.034 0.622
mobile_bill_payment 0.0607 0.396 0.153 0.879 -0.744 0.865
==============================================================================
Omnibus: 3.825 Durbin-Watson: 2.053
Prob(Omnibus): 0.148 Jarque-Bera (JB): 2.562
Skew: -0.484 Prob(JB): 0.278
Kurtosis: 3.775 Cond. No. 618.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Creating a Visual Representation of the Regression Analsis for Model 1
# Getting the predicted values from the OLS model
predicted_values = ols_model1.predict()
# Creating a DataFrame with actual and predicted values
results_df = pd.DataFrame({'Actual': dep_var_model1, 'Predicted': predicted_values})
# Creating a scatter plot with a regression line using Seaborn
sns.lmplot(x='Actual', y='Predicted', data=results_df, line_kws={'color': 'red'})
# Setting plot labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Model 1: Actual vs. Predicted Values')
# Show plot
plt.show()
# Indicating and separating the dependent and independent variables from the model 2 dataframe
# Dependent variable definition for model 2
dep_var_model2 = model2['account']
# Independent variable definition for model 2
indep_var_model2 = model2[['source_data_capacity', 'spi_data_use',
'spi_data_services', 'spi_data_products',
'spi_data_sources', 'spi_data_infrastructure']]
# Adding a constant term to the independent variables for the intercept
indep_var_model2 = sm.add_constant(indep_var_model2)
# Fitting the OLS regression model for model 2
ols_model2 = sm.OLS(dep_var_model2, indep_var_model2).fit()
# Printing model summary for model 2
print(ols_model2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: account R-squared: 0.612
Model: OLS Adj. R-squared: 0.529
Method: Least Squares F-statistic: 7.353
Date: Tue, 23 Apr 2024 Prob (F-statistic): 8.64e-05
Time: 16:59:41 Log-Likelihood: -136.66
No. Observations: 35 AIC: 287.3
Df Residuals: 28 BIC: 298.2
Df Model: 6
Covariance Type: nonrobust
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
const 6.8971 25.628 0.269 0.790 -45.600 59.394
source_data_capacity -0.5816 0.214 -2.720 0.011 -1.020 -0.144
spi_data_use -0.2143 0.243 -0.882 0.385 -0.712 0.284
spi_data_services -0.7636 0.228 -3.343 0.002 -1.231 -0.296
spi_data_products 0.7052 0.485 1.455 0.157 -0.288 1.698
spi_data_sources 1.1690 0.273 4.276 0.000 0.609 1.729
spi_data_infrastructure 0.8336 0.289 2.880 0.008 0.241 1.427
==============================================================================
Omnibus: 2.268 Durbin-Watson: 2.512
Prob(Omnibus): 0.322 Jarque-Bera (JB): 1.155
Skew: -0.237 Prob(JB): 0.561
Kurtosis: 3.753 Cond. No. 1.72e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
# Creating a Visual Representation of the Regression Analsis for Model 2
# Getting the predicted values from the OLS model for model 2
predicted_values_model2 = ols_model2.predict()
# Creating a DataFrame with actual and predicted values for model 2
results_model2 = pd.DataFrame({'Actual': dep_var_model2, 'Predicted': predicted_values_model2})
# Creating a regression plot using Seaborn for model 2
sns.lmplot(x='Actual', y='Predicted', data=results_model2, scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})
# Setting plot labels and title
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Model 2: Actual vs. Predicted Values')
# Show plot
plt.show()
# creating a correlation matrix for model 1
correlation_matrix_model1 = model1.corr()
# Displaying the correlation matrix
print(correlation_matrix_model1)
# Plotting the heatmap of correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_model1, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Model 1: Correlation Matrix')
plt.show()
internet_access mobile_utility_payment \
internet_access 1.000000 0.275408
mobile_utility_payment 0.275408 1.000000
digital_payment 0.507508 0.736970
mobile_phone_ownership 0.856528 0.346597
mobile_bill_payment 0.389306 0.940515
mobile_money_account 0.092876 0.799250
digital_payment mobile_phone_ownership \
internet_access 0.507508 0.856528
mobile_utility_payment 0.736970 0.346597
digital_payment 1.000000 0.573997
mobile_phone_ownership 0.573997 1.000000
mobile_bill_payment 0.809292 0.440903
mobile_money_account 0.806038 0.279005
mobile_bill_payment mobile_money_account
internet_access 0.389306 0.092876
mobile_utility_payment 0.940515 0.799250
digital_payment 0.809292 0.806038
mobile_phone_ownership 0.440903 0.279005
mobile_bill_payment 1.000000 0.788281
mobile_money_account 0.788281 1.000000
# creating a correlation matrix for model 2
correlation_matrix_model2 = model2.corr()
# Displaying the correlation matrix
print(correlation_matrix_model1)
# Plotting the heatmap of correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_model2, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Model 2: Correlation Matrix')
plt.show()
internet_access mobile_utility_payment \
internet_access 1.000000 0.275408
mobile_utility_payment 0.275408 1.000000
digital_payment 0.507508 0.736970
mobile_phone_ownership 0.856528 0.346597
mobile_bill_payment 0.389306 0.940515
mobile_money_account 0.092876 0.799250
digital_payment mobile_phone_ownership \
internet_access 0.507508 0.856528
mobile_utility_payment 0.736970 0.346597
digital_payment 1.000000 0.573997
mobile_phone_ownership 0.573997 1.000000
mobile_bill_payment 0.809292 0.440903
mobile_money_account 0.806038 0.279005
mobile_bill_payment mobile_money_account
internet_access 0.389306 0.092876
mobile_utility_payment 0.940515 0.799250
digital_payment 0.809292 0.806038
mobile_phone_ownership 0.440903 0.279005
mobile_bill_payment 1.000000 0.788281
mobile_money_account 0.788281 1.000000
# Creating a DataFrame to store VIF results
vif_data_model1 = pd.DataFrame()
vif_data_model1["Variable"] = indep_var_model1.columns
vif_data_model1["VIF"] = [variance_inflation_factor(indep_var_model1.values, i) for i in range(indep_var_model1.shape[1])]
print("Variance Inflation Factor (VIF) for Model 1:")
print(vif_data_model1)
Variance Inflation Factor (VIF) for Model 1:
Variable VIF
0 const 40.483693
1 internet_access 3.902965
2 mobile_utility_payment 9.475282
3 digital_payment 3.495942
4 mobile_phone_ownership 4.171238
5 mobile_bill_payment 12.216094
# Calculating correlation matrix for independent variables in Model 1
correlation_matrix_model1 = model1[['internet_access', 'mobile_utility_payment',
'digital_payment', 'mobile_phone_ownership',
'mobile_bill_payment']].corr()
# Displaying correlation matrix for Model 1
print("Correlation Matrix for Model 1:")
print(correlation_matrix_model1)
# Plotting correlation matrix for Model 1
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_model1, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Model 1: Test for Multicollinearity Matrix')
plt.show()
Correlation Matrix for Model 1:
internet_access mobile_utility_payment \
internet_access 1.000000 0.275408
mobile_utility_payment 0.275408 1.000000
digital_payment 0.507508 0.736970
mobile_phone_ownership 0.856528 0.346597
mobile_bill_payment 0.389306 0.940515
digital_payment mobile_phone_ownership \
internet_access 0.507508 0.856528
mobile_utility_payment 0.736970 0.346597
digital_payment 1.000000 0.573997
mobile_phone_ownership 0.573997 1.000000
mobile_bill_payment 0.809292 0.440903
mobile_bill_payment
internet_access 0.389306
mobile_utility_payment 0.940515
digital_payment 0.809292
mobile_phone_ownership 0.440903
mobile_bill_payment 1.000000
# Creating a DataFrame to store VIF results
vif_data_model2 = pd.DataFrame()
vif_data_model2["Variable"] = indep_var_model2.columns
vif_data_model2["VIF"] = [variance_inflation_factor(indep_var_model2.values, i) for i in range(indep_var_model2.shape[1])]
print("Variance Inflation Factor (VIF) for Model 2:")
print(vif_data_model2)
Variance Inflation Factor (VIF) for Model 2:
Variable VIF
0 const 127.526234
1 source_data_capacity 2.453183
2 spi_data_use 2.039501
3 spi_data_services 2.109660
4 spi_data_products 2.651508
5 spi_data_sources 2.616190
6 spi_data_infrastructure 2.604678
# Calculatting correlation matrix for independent variables in Model 2
correlation_matrix_model2 = model2[['source_data_capacity', 'spi_data_use',
'spi_data_services', 'spi_data_products',
'spi_data_sources', 'spi_data_infrastructure']].corr()
# Displaying correlation matrix for Model 2
print("Correlation Matrix for Model 2:")
print(correlation_matrix_model2)
# Plotting correlation matrix for Model 2
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_model2, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Model 2: Test for Multicollinearity Matrix')
plt.show()
Correlation Matrix for Model 2:
source_data_capacity spi_data_use \
source_data_capacity 1.000000 0.368934
spi_data_use 0.368934 1.000000
spi_data_services 0.199737 0.243448
spi_data_products 0.482582 0.585447
spi_data_sources 0.605898 0.268184
spi_data_infrastructure 0.654020 0.539743
spi_data_services spi_data_products \
source_data_capacity 0.199737 0.482582
spi_data_use 0.243448 0.585447
spi_data_services 1.000000 0.569445
spi_data_products 0.569445 1.000000
spi_data_sources 0.582020 0.546027
spi_data_infrastructure 0.378305 0.439675
spi_data_sources spi_data_infrastructure
source_data_capacity 0.605898 0.654020
spi_data_use 0.268184 0.539743
spi_data_services 0.582020 0.378305
spi_data_products 0.546027 0.439675
spi_data_sources 1.000000 0.592323
spi_data_infrastructure 0.592323 1.000000
# Calculating descriptive statistics for model 1
desc_model1 = model1.describe()
desc_model1
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
internet_access | mobile_utility_payment | digital_payment | mobile_phone_ownership | mobile_bill_payment | mobile_money_account | |
---|---|---|---|---|---|---|
count | 40.000000 | 40.000000 | 40.000000 | 40.000000 | 40.000000 | 40.000000 |
mean | 37.359000 | 9.635500 | 41.865250 | 72.711750 | 13.412500 | 28.772500 |
std | 19.445526 | 9.140128 | 19.434911 | 15.605675 | 10.886769 | 19.105076 |
min | 5.220000 | 0.000000 | 4.810000 | 32.160000 | 1.450000 | 0.000000 |
25% | 23.115000 | 3.142500 | 26.740000 | 63.490000 | 4.937500 | 9.782500 |
50% | 31.055000 | 6.225000 | 42.985000 | 75.665000 | 9.855000 | 31.865000 |
75% | 49.785000 | 13.865000 | 54.182500 | 83.212500 | 18.960000 | 42.477500 |
max | 82.910000 | 36.060000 | 80.810000 | 100.000000 | 44.910000 | 68.660000 |
# Selecting only numeric columns for calculating descriptive statistics
numeric_cols_model1 = ['mobile_money_account','internet_access', 'mobile_utility_payment',
'digital_payment', 'mobile_phone_ownership',
'mobile_bill_payment']
desc_model1 = model1[numeric_cols_model1].describe()
# Adding skewness and kurtosis
desc_model1.loc['skewness'] = model1[numeric_cols_model1].skew()
desc_model1.loc['kurtosis'] = model1[numeric_cols_model1].kurt()
# Adding additional metrics
desc_model1.loc['Standard Error'] = model1[numeric_cols_model1].sem()
desc_model1.loc['Mode'] = model1[numeric_cols_model1].mode().iloc[0] # Get the mode from the first row
desc_model1.loc['Sample Variance'] = model1[numeric_cols_model1].var()
desc_model1.loc['Range'] = model1[numeric_cols_model1].max() - model1[numeric_cols_model1].min()
desc_model1.loc['Sum'] = model1[numeric_cols_model1].sum()
# Displaying the summary
desc_model1
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
mobile_money_account | internet_access | mobile_utility_payment | digital_payment | mobile_phone_ownership | mobile_bill_payment | |
---|---|---|---|---|---|---|
count | 40.000000 | 40.000000 | 40.000000 | 40.000000 | 40.000000 | 40.000000 |
mean | 28.772500 | 37.359000 | 9.635500 | 41.865250 | 72.711750 | 13.412500 |
std | 19.105076 | 19.445526 | 9.140128 | 19.434911 | 15.605675 | 10.886769 |
min | 0.000000 | 5.220000 | 0.000000 | 4.810000 | 32.160000 | 1.450000 |
25% | 9.782500 | 23.115000 | 3.142500 | 26.740000 | 63.490000 | 4.937500 |
50% | 31.865000 | 31.055000 | 6.225000 | 42.985000 | 75.665000 | 9.855000 |
75% | 42.477500 | 49.785000 | 13.865000 | 54.182500 | 83.212500 | 18.960000 |
max | 68.660000 | 82.910000 | 36.060000 | 80.810000 | 100.000000 | 44.910000 |
skewness | 0.092837 | 0.692206 | 1.338398 | 0.238878 | -0.677023 | 1.395810 |
kurtosis | -1.058531 | -0.184415 | 1.173069 | -0.652349 | 0.111974 | 1.681350 |
Standard Error | 3.020778 | 3.074608 | 1.445181 | 3.072929 | 2.467474 | 1.721349 |
Mode | 29.380000 | 5.220000 | 13.610000 | 33.740000 | 82.760000 | 1.450000 |
Sample Variance | 365.003927 | 378.128496 | 83.541943 | 377.715749 | 243.537097 | 118.521737 |
Range | 68.660000 | 77.690000 | 36.060000 | 76.000000 | 67.840000 | 43.460000 |
Sum | 1150.900000 | 1494.360000 | 385.420000 | 1674.610000 | 2908.470000 | 536.500000 |
# Calculating descriptive statistics for model 2
desc_model2 = model2.describe()
# Displaying the summary
desc_model2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
source_data_capacity | spi_data_use | spi_data_services | spi_data_products | spi_data_sources | spi_data_infrastructure | account | |
---|---|---|---|---|---|---|---|
count | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 |
mean | 55.142857 | 76.851429 | 61.064857 | 78.389429 | 32.927143 | 51.857143 | 48.731429 |
std | 16.868249 | 13.529605 | 14.641475 | 7.736023 | 13.621253 | 12.838689 | 19.553514 |
min | 20.000000 | 40.000000 | 20.600000 | 53.920000 | 9.040000 | 25.000000 | 5.830000 |
25% | 45.000000 | 70.000000 | 60.300000 | 76.540000 | 25.590000 | 42.500000 | 34.560000 |
50% | 60.000000 | 80.000000 | 63.870000 | 78.960000 | 29.180000 | 50.000000 | 49.490000 |
75% | 65.000000 | 90.000000 | 67.200000 | 82.445000 | 38.620000 | 60.000000 | 61.690000 |
max | 80.000000 | 100.000000 | 86.670000 | 89.390000 | 70.280000 | 80.000000 | 90.530000 |
# Selecting only numeric columns for calculating descriptive statistics
numeric_cols_model2 = ['account','source_data_capacity', 'spi_data_use',
'spi_data_services', 'spi_data_products',
'spi_data_sources', 'spi_data_infrastructure']
desc_model2 = model2[numeric_cols_model2].describe()
# Adding skewness and kurtosis
desc_model2.loc['skewness'] = model2[numeric_cols_model2].skew()
desc_model2.loc['kurtosis'] = model2[numeric_cols_model2].kurt()
# Adding additional metrics
desc_model2.loc['Standard Error'] = model2[numeric_cols_model2].sem()
desc_model2.loc['Mode'] = model2[numeric_cols_model2].mode().iloc[0] # Get the mode from the first row
desc_model2.loc['Sample Variance'] = model2[numeric_cols_model2].var()
desc_model2.loc['Range'] = model2[numeric_cols_model2].max() - model2[numeric_cols_model2].min()
desc_model2.loc['Sum'] = model2[numeric_cols_model2].sum()
# Displaying the summary
desc_model2
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
account | source_data_capacity | spi_data_use | spi_data_services | spi_data_products | spi_data_sources | spi_data_infrastructure | |
---|---|---|---|---|---|---|---|
count | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 | 35.000000 |
mean | 48.731429 | 55.142857 | 76.851429 | 61.064857 | 78.389429 | 32.927143 | 51.857143 |
std | 19.553514 | 16.868249 | 13.529605 | 14.641475 | 7.736023 | 13.621253 | 12.838689 |
min | 5.830000 | 20.000000 | 40.000000 | 20.600000 | 53.920000 | 9.040000 | 25.000000 |
25% | 34.560000 | 45.000000 | 70.000000 | 60.300000 | 76.540000 | 25.590000 | 42.500000 |
50% | 49.490000 | 60.000000 | 80.000000 | 63.870000 | 78.960000 | 29.180000 | 50.000000 |
75% | 61.690000 | 65.000000 | 90.000000 | 67.200000 | 82.445000 | 38.620000 | 60.000000 |
max | 90.530000 | 80.000000 | 100.000000 | 86.670000 | 89.390000 | 70.280000 | 80.000000 |
skewness | -0.048651 | -0.377096 | -0.616829 | -1.168967 | -1.191027 | 0.971202 | 0.102107 |
kurtosis | -0.076453 | -0.434600 | 0.417437 | 1.996276 | 1.960332 | 1.244482 | -0.361374 |
Standard Error | 3.305147 | 2.851254 | 2.286921 | 2.474861 | 1.307627 | 2.302412 | 2.170135 |
Mode | 5.830000 | 60.000000 | 80.000000 | 57.030000 | 53.920000 | 9.040000 | 40.000000 |
Sample Variance | 382.339924 | 284.537815 | 183.050218 | 214.372796 | 59.846053 | 185.538533 | 164.831933 |
Range | 84.700000 | 60.000000 | 60.000000 | 66.070000 | 35.470000 | 61.240000 | 55.000000 |
Sum | 1705.600000 | 1930.000000 | 2689.800000 | 2137.270000 | 2743.630000 | 1152.450000 | 1815.000000 |
# Skewness data for Model 1
skewness_model1 = [0.09, 0.69, 1.34, 0.24, -0.68, 1.40]
# Variable names for Model 2
variables_model1 = ['MMA', 'IA', 'MUP', 'DP', 'MPO', 'MBP']
# Plotting skewness for Model 2
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=skewness_model1, y=variables_model1, palette="viridis")
plt.title('Skewness for Model 1 Variables')
plt.xlabel('Skewness')
plt.ylabel('Variables')
# Adding data points to the bars
for i in range(len(skewness_model1)):
ax.text(skewness_model1[i], i, f'{skewness_model1[i]:.2f}', ha='left', va='center', color='black')
plt.show()
# Skewness data for Model 2
skewness_model2 = [-0.05, -0.38, -0.62, -1.17, -1.19, 0.97, 0.10]
# Variable names for Model 2
variables_model2 = ['ACC', 'SDC', 'DUS', 'DSS', 'DPS', 'DSC', 'DIS']
# Plotting skewness for Model 2
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=skewness_model2, y=variables_model2, palette="viridis")
plt.title('Skewness for Model 2 Variables')
plt.xlabel('Skewness')
plt.ylabel('Variables')
# Adding data points to the bars
for i in range(len(skewness_model2)):
ax.text(skewness_model2[i], i, f'{skewness_model2[i]:.2f}', ha='left', va='center', color='black')
plt.show()
# Line of Best fit for Model 1
ols_model1
# Overall Model Significance Test (F-test) for Model 1
f_statistic, f_p_value = ols_model1.fvalue, ols_model1.f_pvalue
print("Model 1 F-statistic:", f_statistic)
print("Model 1 p-value (F-test):", f_p_value)
# Individual Coefficient Significance Test (t-test) for Model 1
t_test_results = ols_model1.t_test(np.eye(len(ols_model1.params)))
print("Individual Coefficient p-values (t-test) for Model 1:")
for idx, variable in enumerate(indep_var_model1.columns):
print(f"{variable}: {t_test_results.pvalue[idx]}")
Model 1 F-statistic: 41.184773278043544
Model 1 p-value (F-test): 1.7719819035065024e-13
Individual Coefficient p-values (t-test) for Model 1:
const: 0.3459100134441736
internet_access: 9.831638939501509e-05
mobile_utility_payment: 0.09440462482524704
digital_payment: 3.032239454019748e-06
mobile_phone_ownership: 0.07693681973223013
mobile_bill_payment: 0.8790651297196372
# Line of best fit for Model 2
ols_model2
# Overall Model Significance Test (F-test) for Model 2
f_statistic, f_p_value = ols_model2.fvalue, ols_model2.f_pvalue
print("Model 2 F-statistic:", f_statistic)
print("Model 2 p-value (F-test):", f_p_value)
# Individual Coefficient Significance Test (t-test) for Model 2
t_test_results = ols_model2.t_test(np.eye(len(ols_model2.params)))
print("Individual Coefficient p-values (t-test) for Model 2:")
for idx, variable in enumerate(indep_var_model2.columns):
print(f"{variable}: {t_test_results.pvalue[idx]}")
Model 2 F-statistic: 7.352573148811251
Model 2 p-value (F-test): 8.644605466870827e-05
Individual Coefficient p-values (t-test) for Model 2:
const: 0.7898080501069904
source_data_capacity: 0.011083716383894056
spi_data_use: 0.38538856115979137
spi_data_services: 0.0023663090448723797
spi_data_products: 0.15678092693555912
spi_data_sources: 0.0001999381497366182
spi_data_infrastructure: 0.0075410358992607385
# Exporting regression result as an Excel File
# Convert the regression results summary to a dataframe
regression_results_model1 = ols_model1.summary()
regression_results_df_model1 = pd.read_html(regression_results_model1.tables[1].as_html(), header=0, index_col=0)[0]
# Export the dataframe to an Excel file
regression_results_df_model1.to_excel("regression_results_model1.xlsx")
print("Regression results for Model 1.xlsx")
Regression results for Model 1.xlsx
# Exporting descriptive stats result as an excel file for Model 1
desc_model1.to_excel('desc_model1.xlsx')
print('Descriptive statistics for Model 1.xlsx')
Descriptive statistics for Model 1.xlsx
# Exporting descriptive stats result as an excel file for Model 2
desc_model2.to_excel('desc_model2.xlsx')
print('Descriptive statistics for Model 2.xlsx')
Descriptive statistics for Model 2.xlsx
predicted_values_model2
array([36.25275297, 67.71857471, 52.46712807, 47.98902335, 18.8266856 ,
28.58058205, 44.9136686 , 53.01701255, 67.08734359, 50.64486208,
56.00199781, 44.26275099, 62.90811648, 39.46190352, 59.06135446,
61.11429563, 45.06825223, 38.74550916, 53.30689915, 36.97488527,
33.13351753, 70.41003994, 44.86373269, 43.75912241, 46.66711801,
40.7967662 , 62.99003982, 34.50329914, 99.3950081 , 19.99720639,
41.06566196, 50.30056932, 51.74385254, 41.88368771, 59.68677996])