# Computing correlation coefficients and p-values
We will test that certain features are dependant variables based on the flow of traffic.
We want to assess the strength and direction of the linear relationship between a few continuous variables, but we are primarily interested in exploring associations between variables rather than predicting outcomes.

#### Methodologies used:
- Pearson coefficient
- Spearman coefficient

"Pearson's correlation coefficient assesses a linear relationship, and is closely related to simple linear regression. Spearman's correlation coefficient works on ranks and therefore does not assess a linear relationship".

https://stats.stackexchange.com/questions/625858/

In [49]:
import pandas as pd

In [50]:
df = pd.read_csv('../../data/processed/proposal_exploration/regression_intersection_neighborhood_features.csv')
df

Unnamed: 0,location_id,neighbourhood_id,total_traffic_volume,total_median_income,labor_force_amount
0,3969,1,13663,33600,18405
1,3970,1,75871,33600,18405
2,3971,1,28496,33600,18405
3,3972,1,27930,33600,18405
4,3973,1,21627,33600,18405
...,...,...,...,...,...
2634,41926,174,1904,52400,15070
2635,42259,174,10081,52400,15070
2636,42525,174,1364,52400,15070
2637,42526,174,8809,52400,15070


In [69]:
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr, spearmanr

### Test #1 - correlation using pearson correlation coefficient.
The independeant varible will be total_traffic_volume
The dependent variables will be total_median_income, and labor_force_amount

In [71]:
# Calculate Pearson correlation coefficients and p-values
corr_income, pval_income = pearsonr(df['total_traffic_volume'], df['total_median_income'])
corr_labor, pval_labor = pearsonr(df['total_traffic_volume'], df['labor_force_amount'])

print("Pearson correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Pearson correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

print('\n')
# Calculate Spearman correlation coefficients and p-values
corr_income, pval_income = spearmanr(df['total_traffic_volume'], df['total_median_income'])
corr_labor, pval_labor = spearmanr(df['total_traffic_volume'], df['labor_force_amount'])

print("Spearman correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Spearman correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

Pearson correlation coefficient between total_traffic_volume and total_median_income: 0.04438154509056399
P-value for total_median_income: 0.0226090882156005
Pearson correlation coefficient between total_traffic_volume and labor_force_amount: 0.04303210323430981
P-value for labor_force_amount: 0.02706442604714477


Spearman correlation coefficient between total_traffic_volume and total_median_income: -0.008442706611860426
P-value for total_median_income: 0.6646412513358554
Spearman correlation coefficient between total_traffic_volume and labor_force_amount: 0.09460446777451627
P-value for labor_force_amount: 1.1241171072950731e-06


#### Results #1
 0.04438154509056399 pearson correlation coefficient between total_traffic_volume and total_median_income:. 

 No relationship at all, but likely beacuse the intersections need to be grouped by nieghborhood. Income is measured on a per neighborhood basis, therefore, traffic flow should also be measured on a per neighborhood basis.

 Therefore, the traffic volume based on nieghborhood ID will be grouped and summed.

### Test #2 - correlation using pearson correlation coefficient.
Traffic flow will be summed on a per neighborhood basis.


In [62]:
# Summing the volume of all intersections based on neighborhood ID
test2_df_grouped = df.groupby('neighbourhood_id')['total_traffic_volume'].sum().reset_index()

# Prepare a df to merge into
df2merge = df.drop(['total_traffic_volume','location_id'],axis=1)
#df2merge

# merge df if total_median income and total traffic volume per each neighborhood
test2_df = pd.merge(test2_df_grouped, df2merge, how="inner", on=['neighbourhood_id', 'neighbourhood_id'])

# Remove duplicated rows after merging
test2_df = test2_df.drop_duplicates()

test2_df

Unnamed: 0,neighbourhood_id,total_traffic_volume,total_median_income,labor_force_amount
0,1,853529,33600,18405
53,2,212521,29600,14360
67,3,89635,32800,4990
73,4,101721,33600,5305
82,5,64063,34400,4425
...,...,...,...,...
2537,170,1525153,44000,8620
2554,171,571144,41200,14040
2583,172,557818,38000,7460
2620,173,75249,46000,10870


In [72]:
# Calculate Pearson correlation coefficients and p-values
corr_income, pval_income = pearsonr(test2_df['total_traffic_volume'], test2_df['total_median_income'])
corr_labor, pval_labor = pearsonr(test2_df['total_traffic_volume'], test2_df['labor_force_amount'])

print("Pearson correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Pearson correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

print('\n')
# Calculate Spearman correlation coefficients and p-values
corr_income, pval_income = spearmanr(test2_df['total_traffic_volume'], test2_df['total_median_income'])
corr_labor, pval_labor = spearmanr(test2_df['total_traffic_volume'], test2_df['labor_force_amount'])

print("Spearman correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Spearman correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

Pearson correlation coefficient between total_traffic_volume and total_median_income: 0.24783702665734317
P-value for total_median_income: 0.0016914499414639246
Pearson correlation coefficient between total_traffic_volume and labor_force_amount: 0.4388234527274796
P-value for labor_force_amount: 8.051666477901505e-09


Spearman correlation coefficient between total_traffic_volume and total_median_income: 0.14547115113501524
P-value for total_median_income: 0.06819160761214493
Spearman correlation coefficient between total_traffic_volume and labor_force_amount: 0.441823849143379
P-value for labor_force_amount: 6.195300715766874e-09


#### Results #2
There is a low correlation between traffic volume and median income.
There is a moderate correlation between traffic volume of all intersections within a neighborhood, and the amount of people in the labour force in that specific neighborhood. The p-value for the labor force amount is very low (below 0.05), making this statsically significant to reject a null hypothesis.

### Test #3
Traffic flow will not be summed on a per-neighborhood basis. Instead, we will find the median traffic flow on a per nieghborhood basis.

In [64]:
# getting the median volume of all intersections based on neighborhood ID
test3_df_grouped = df.groupby('neighbourhood_id')['total_traffic_volume'].median().reset_index()

# Prepare a df to merge into
df3merge = df.drop(['total_traffic_volume','location_id'],axis=1)
#df2merge

# merge df if total_median income and total traffic volume per each neighborhood
test3_df = pd.merge(test3_df_grouped, df3merge, how="inner", on=['neighbourhood_id', 'neighbourhood_id'])

# Remove duplicated rows after merging
test3_df = test3_df.drop_duplicates()

test3_df

Unnamed: 0,neighbourhood_id,total_traffic_volume,total_median_income,labor_force_amount
0,1,9884.0,33600,18405
53,2,8359.0,29600,14360
67,3,11429.5,32800,4990
73,4,10510.0,33600,5305
82,5,5146.5,34400,4425
...,...,...,...,...
2537,170,40533.0,44000,8620
2554,171,9243.0,41200,14040
2583,172,7534.0,38000,7460
2620,173,15994.0,46000,10870


In [65]:
# Calculate Pearson correlation coefficients and p-values
corr_income, pval_income = pearsonr(test3_df['total_traffic_volume'], test3_df['total_median_income'])
corr_labor, pval_labor = pearsonr(test3_df['total_traffic_volume'], test3_df['labor_force_amount'])

print("Pearson correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Pearson correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

Pearson correlation coefficient between total_traffic_volume and total_median_income: -0.07687571901562024
P-value for total_median_income: 0.3370242157512205
Pearson correlation coefficient between total_traffic_volume and labor_force_amount: 0.1738734826302417
P-value for labor_force_amount: 0.02889824445161856


#### Results #3
Using the median traffic volume of all intersections per each neighborhood, yeilded significantly worse results than using the summation.
For the total traffic volume and median income, there was a very low negative correlation.
For total traffic volume and the amount of people in the labour force, there was also very low correlation.

The p-value was not statsically significant for total_median_income, therefore we cannot reject the null hypothesis if we pursue this route.

### Test #4
Traffic flow will be averaged from all intersections on a per-neighborhood basis. 

In [66]:
# getting the median volume of all intersections based on neighborhood ID
test4_df_grouped = df.groupby('neighbourhood_id')['total_traffic_volume'].mean().reset_index()

# Prepare a df to merge into
df4merge = df.drop(['total_traffic_volume','location_id'],axis=1)
#df2merge

# merge df if total_median income and total traffic volume per each neighborhood
test4_df = pd.merge(test4_df_grouped, df3merge, how="inner", on=['neighbourhood_id', 'neighbourhood_id'])

# Remove duplicated rows after merging
test4_df = test4_df.drop_duplicates()

test4_df

Unnamed: 0,neighbourhood_id,total_traffic_volume,total_median_income,labor_force_amount
0,1,16104.320755,33600,18405
53,2,15180.071429,29600,14360
67,3,14939.166667,32800,4990
73,4,11302.333333,33600,5305
82,5,10677.166667,34400,4425
...,...,...,...,...
2537,170,89714.882353,44000,8620
2554,171,19694.620690,41200,14040
2583,172,15076.162162,38000,7460
2620,173,18812.250000,46000,10870


In [68]:
# Calculate Pearson correlation coefficients and p-values
corr_income, pval_income = pearsonr(test4_df['total_traffic_volume'], test4_df['total_median_income'])
corr_labor, pval_labor = pearsonr(test4_df['total_traffic_volume'], test4_df['labor_force_amount'])

print("Pearson correlation coefficient between total_traffic_volume and total_median_income:", corr_income)
print("P-value for total_median_income:", pval_income)

print("Pearson correlation coefficient between total_traffic_volume and labor_force_amount:", corr_labor)
print("P-value for labor_force_amount:", pval_labor)

Pearson correlation coefficient between total_traffic_volume and total_median_income: 0.0095200146484541
P-value for total_median_income: 0.9054993797141337
Pearson correlation coefficient between total_traffic_volume and labor_force_amount: 0.13275000935339482
P-value for labor_force_amount: 0.09636057012554836


#### Results #4
There is practically no relatonship between the average traffic flow from all intersections in a neighborhood and the medium income of people.
There is very small correlation between the average traffic flow from all intersections in a nieghborhood, and the amount of people in the labour force.

Furthermore, the p-values for both features are above 0.05, making these relationships not statisically significant

## Discussion

**Hypothesis**: Per each neighborhood, there is a relationship between the total traffic volume across all intersections and the amount of people in the labor force.

**Null-hypothesis**: Per each neighborhood, there is not a relationship between the total traffic volume across all intersections and the amount of people in the labor force.

### Excerpt  of Test #2 Results:
- For the correlation between total traffic volume and amount of people in the labor force:
  - Spearman correlation coefficient: 0.4418
  - P-value: 6.20e-09

### Conclusion
At a significance level of 0.05, the results of Test #2 <b>indicate a rejection of the null hypothesis</b>, suggesting a statistically significant relationship between the total traffic volume across all intersections and the number of people in the labor force within each neighborhood. As a result, further investigation will be pursued to glean additional insights into the significance of this relationship and its implications for understanding local economic expansion.
