# Homophily in Rural Villages in India


**Prerequisites**

- Introduction to Graphs
- Strong and Weak Ties
- Homophily

**Outcomes**

- Case study of homophily amongst residents of rural towns in India
- Hands on practice working with graphs, DataFrames and combining them

**References**

- [Easley and Kleinberg](https://www.cs.cornell.edu/home/kleinber/networks-book/) chapter 4
- [DataCamp exercises](https://campus.datacamp.com/courses/using-python-for-research/case-study-6-social-network-analysis?ex=1)  with this dataset


## Part 1: Counting and Frequencies

To prepare for studying Homophiliy, we need to write a few Python functions that will allow us to count the number of occurances of a characteristic, then compute the frequency of occurance.

In [None]:
def counts(obs: list) -> dict:
    """
    Count the Number of occurances of each item in the list `x`

    The return value should be a dict mapping from items in `x` to
    the number of times each item occurs in x

    Example:

    `counts([1, 2, 1, 1])` should return `{1: 3, 2:}`
    """
    out = dict()
    
    # YOUR CODE HERE
    raise NotImplementedError()
    # HINT: the function `out.get(item, 0)` might be helpful... see docs

    return out

In [None]:
assert counts([1, 1, 2, 1]) == {1: 3, 2: 1}
assert counts([1, 1, 2, 1, 3]) == {1: 3, 2: 1, 3: 1}
assert counts(["a", "world", "b", "b", "world"]) == {"a": 1, "b": 2, "world": 2}

Next step we will use our counting function to then compute frequencies

We will have two methods of this function

1. A method that works on `Dict{T,Int}` that assumes the Dict contains counts
2. A method that consumes a `Vector{T}` and will first compute counts and then call the first method

In [None]:
def frequencies(obs: list) -> int:
    """
    Given a list of, compute the frequency of observations of each value in obs


    Example:

    `frequencies([1, 1, 2, 1])` should return `{1: 0.75, 2: 0.25}`
    
    Notes:
    
    Uses the `counts` function above

    """
    # YOUR CODE HERE
    raise NotImplementedError()
    

In [None]:
assert frequencies([1, 1, 2, 1]) == {1: 0.75, 2: 0.25}
assert frequencies([1, 1, 2, 1, 3]) == {1: 0.6, 2: 0.2, 3: 0.2}
assert frequencies(["a", "world", "b", "b", "world"]) == {"a": 0.2, "b": 0.4, "world": 0.4}

## Part 2: Chance Homophily


Let's now write a function that can compute the degree of homophily we would expect if edges were formed entirely by chance and the realization of node characteristics was also random

Suppose there is a characteristic `X` with possible values [a, b, c]

Furthermore, suppose that the probability that an individual node has realization `a` is $p_a$. Similarly $p_b$ and $p_c$ represent probabilities of observing b and c, respectively

Now consider an edge between two random nodes. The probability that the edge is between two `a` type nodes is $p_a^2$. We could say the same for `b` and `c`

Therefore, the probability that a random edge will be formed between two individuals that share characteristic `X` is given by

$$\sum_{x \in \{a,b,c\}} p_x^2$$

In our running example of realizations `[1, 2, 1, 1]` we would say that the probability of chance homophily is 0.75^2 + 0.25^2 = $(0.75^2 + 0.25^2)$

Implement the function below that computes chance_homophiliy given a dict of frequency of occurance (frequency is our estimate of the probabilities $p$ above)

In [None]:
def chance_homophily(d: dict) -> float:
    """
    Given a dict of observed characteristic frequencies, 
    compute the chance_homophily, which is probability of 
    a random edge forming between two nodes that share a 
    characteristic

    Example:

    `chance_homophily(Dict(1 => 0.75, 2 => 0.25))` returns `0.625`
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert abs(chance_homophily({1: 0.75, 2: 0.25}) - 0.625) < 1e-8
assert abs(chance_homophily({"a": 0.2, "b": 0.4, "world": 0.4}) -  0.36) < 1e-8
assert abs(chance_homophily({1: 0.6, 2: 0.2, 3: 0.2}) -  0.44) < 1e-8

## Part 3: Loading Data

Let's now load up our observed characteristics data

The function `load_remote_csv` defined for you below can download a CSV file from the internet and load it into a DataFrame

Use this method to load the dataset at `https://compsosci-resources.s3.amazonaws.com/graph-theory-lectures/data/india_village_individual_characteristics.csv` into a DataFrame

In [None]:
import pandas as pd 

def load_data():
    # YOUR CODE HERE
    raise NotImplementedError()
    pass

In [None]:
df = load_data()
assert df.shape == (16984, 48)

## Part 4: Loading Graphs

We will also need to load the network data into a Graph

The network is stored in two Graphs -- one for each village

The variables `v1_graph_url` contains the url for the graph for village 1

`v2_graph_url` has the url for village 2

The contents of each of these files is a CSV containing an adjacency matrix

Item [i,j] is either 0 or 1, depending on if an edge exists (`1`) or not (`0`) between nodes i and j

There is one row per node

Your task is to complete the `read_remote_graph` function below to successfuly fetch the file from online, parse it as a CSV, and convert to a Graph

**WARNING** the csv files at `v1_graph_url` and `v2_graph_url` do not have headers. You will have to modify the keyword arguments passed to `pd.read_csv` in order to handle this properly. See the help for pd.read_csv for more information

In [None]:
root_url = "https://compsosci-resources.s3.amazonaws.com/graph-theory-lectures/data"
v1_graph_url = root_url + "/adj_allVillageRelationships_vilno_1.csv"
v2_graph_url = root_url + "/adj_allVillageRelationships_vilno_2.csv"

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
g1 = read_remote_graph(v1_graph_url)
g2 = read_remote_graph(v2_graph_url)

assert len(g1.nodes) == 843
assert len(g2.nodes) == 877

assert len(g1.edges) == 3405
assert len(g2.edges) == 3063

## Part 5: Empirical (observed) Homophily

Let's now explore the degree to which G1 and G2 exhibit homophiliy in the variables `resp_gender`, `caste`, and `religion`.

To do this we'll first need a function that can compute the ratio of edges that form between like-characteristic nodes and the total number of edges

We'll implement that below

In [None]:
def observed_homophily(
        G: nx.Graph,
        characteristics: dict
    ) -> float:
    """
    Given a network G and a dict mapping from node id to a characteristic,
    compute the statistic e_xx / e, where e_xx is the number of edges between
    two nodes that share the same value for the given characteristic and 
    e is the total number of edges
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In order to test this function, we need a way to extract the characteristic dict from our DataFrame. The `extract_characteristic_dict` function below does that for us

In [None]:
def extract_characteristic_dict(df: pd.DataFrame, village: int, characteristic: str) -> dict:
    """
    Given our DataFrame of observations, an integer for which village, 
    and  a string for a characteristic return a dict mapping from 
    the `adjmatrix_key` column to the  value in the `characteristic` 
    column for all residents of the village
    """
    village_df = df.loc[df["village"] == village, :]
    return village_df.set_index("adjmatrix_key")[characteristic].to_dict()

In [None]:
assert abs(
    observed_homophily(g1, extract_characteristic_dict(df, 1, "religion")) - 
    0.9907834101
) < 1e-5

assert abs(
    observed_homophily(g2, extract_characteristic_dict(df, 2, "caste")) - 
    0.8564231738
) < 1e-5

## Part 6: Diagnosing Homophily

Our final step is to compare the chance_homophily that would be observed with random edge formation to the actual homophily we can compute using our new tools.

Your task on this part is to do the following:

- For each of `resp_gender`, `caste`, and `religion` and both village 1 and 2...
- Compute the chance_homophly we would expect for that (variable, village) combination under random edge formation
- Then compute the observed homophily for the (village, variable)
- Finally, make a statement about whether or not your provide evidence that there is a greater than random degree of homophily. Refer to the lecture on homophily to review how to make this decision.

Include any other analysis or results you feel would strengthen your argument.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()