# Network Graph

In [2]:
# imports
import os
import pandas as pd
from sklearn.preprocessing import StandardScaler
from itertools import product


``` MARKDOWN
Network Graph Computation pipeline including: data preparation, computation, and visualization steps to create an insightful similarity network graph and enhance user analysis capabilities.

Data Preprocessing and Merging:
Merged the inflow and outflow migration datasets by grouping on state and county.
Cleaned the house price dataset and merged it with the migration data.
Processed the health dataset by removing non-numeric columns that required further processing. Only kept columns in the correct format and merged this with the migration and house price data to create a unified dataset.
Feature Scaling:

Scaled all numeric attributes in the dataset to normalize the data for similarity computation.
Pairwise County Data Creation:

Created all possible county pairs for the dataset (pairwise data). With around 3,000 counties, this produced approximately 9 million pairs for analysis.
Similarity Matrix Calculation on AWS:

Due to the large data size, the similarity matrix computation will be performed on AWS. The plan is to use an EMR cluster and an S3 bucket to compute the similarity matrix efficiently.
The output will be a similarity measure for each county pair, which could be visualized in Tableau:

 In Tableau to allow users to select a county and view how other counties relate to it in terms of similarity. The similarity metric will be reflected in color intensity—darker colors indicate higher similarity, while lighter colors show lower similarity.

Network Graph Visualization:
The network visualization aims to provide insight into county clusters with similar attributes and high interaction. This clustered view in NetworkX complements Tableau’s map view, offering a clear picture of county relationships and similarities.
For network visualization of strong similarities (e.g., similarity > 0.7), we’ll use NetworkX to create clusters of highly similar counties that are not in the same State. This graph will help understand the relationship between counties, allowing users to see interconnected clusters.


Next steps:
Recalculate and refine the pricing index to improve accuracy for house price analysis.
replace housinhg and health with updated csv from EDA step.

## Data Preparation

In [3]:
# import data
data = pd.read_csv('/Users/judithyemeli/Documents/CSE_6242/Project/MVR/Network_graph_analysis/full_dataset.csv')
print(len(data))
data.head()

3073


Unnamed: 0,State,FIPS,in_return,in_individuals,in_gross_income,out_return,out_individuals,out_gross_income,short_county_code,house_index,...,% Native Hawaiian/Other Pacific Islander,# Hispanic,% Hispanic,# Non-Hispanic White,% Non-Hispanic White,# Not Proficient in English,% Not Proficient in English,% Female,# Rural,% Rural
0,AK,20,579276,1167880,52632390,569940,1147243,51809926,20,98038.851467,...,4.225,64306.75,21.625,129787.0,44.05,5005.65,3005.575,37.8,37109.0,50.766667
1,AK,90,177038,365543,14181531,174474,360165,14011979,90,76671.822998,...,2.4,23650.5,23.75,52393.25,52.525,564.15,7531.725,42.225,13748.666667,76.966667
2,AK,100,4898,9011,363680,4898,9011,363680,100,53136.207111,...,0.75,541.0,21.375,1459.5,58.425,25.85,627.65,63.05,8846.0,36.066667
3,AK,110,64661,124678,5770237,64498,124406,5759018,110,168794.381271,...,2.5,6969.75,21.35,15719.25,48.8,282.475,1685.3,42.225,997.0,100.0
4,AK,122,110847,229224,8601897,111151,229854,8652362,122,91301.777446,...,1.275,13582.25,23.375,35376.25,60.45,215.325,10988.625,55.65,3463.333333,73.833333


In [4]:
# create a unique identifier for each county using both State and FIPS codes
data['id'] = data['State'] + data['FIPS'].astype(str)
data_cleaned= data.drop(['State', 'FIPS'], axis=1)

In [5]:
# scale data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_cleaned.drop('id', axis=1))

In [6]:
# create pairwise combination of counties
def create_county_pairs(df, id_col='county_id'):
    """
    Creates a DataFrame of all possible pairs of counties with their respective data.

    Parameters:
    - df (pd.DataFrame): The original DataFrame with county data.
    - id_col (str): The column name for the county identifier (default is 'county_id').

    Returns:
    - pd.DataFrame: A DataFrame with all unique pairs of counties and their data.
    """
    # Create all possible pairs of counties
    pairs = pd.DataFrame(list(product(df[id_col], df[id_col])), columns=[f'{id_col}_1', f'{id_col}_2'])

    # Filter out self-pairs (where both IDs are the same)
    pairs = pairs[pairs[f'{id_col}_1'] != pairs[f'{id_col}_2']].reset_index(drop=True)

    # Merge the pairs back to the original DataFrame to get the full data for each county
    pairs = pairs.merge(df.add_suffix('_1'), left_on=f'{id_col}_1', right_on=f'{id_col}_1')
    pairs = pairs.merge(df.add_suffix('_2'), left_on=f'{id_col}_2', right_on=f'{id_col}_2')
    
    return pairs

In [7]:
paired_data = create_county_pairs(scaled_data, id_col='id')

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [None]:
len(paired_data)

9440256

In [None]:
# reconstruct the dataframe
final_df = pd.DataFrame(paired_data , columns=data_cleaned.columns[1:])    
final_df['State', 'FIPS', 'id'] = data[['State', 'FIPS', 'id']]

final_df.head()

NameError: name 'pd' is not defined

In [None]:
final_df.to_csv('final_df.csv', index=False)