In [20]:
from collections import Counter
import networkx as nx
import numpy as np

Network homophily occurs when nodes that share an edge share a characteristic more often than nodes that do not share an edge. In this case study, we will investigate homophily of several characteristics of individuals connected in social networks in rural India.

# Exercise 1

**Instructions**

- ```individual_characteristics.dta``` contains several characteristics for each individual in the dataset such as age, religion, and caste. Use the pandas library to read in and store these characteristics as a dataframe called ```df```.
- Store separate datasets for individuals belonging to Villages 1 and 2 as ```df1``` and ```df2```, respectively. (Note that some attributes may be missing for some individuals. Here, investigate only those pairs of nodes where the attributes are known for both nodes. This means that we're effectively assuming that the data are missing completely at random.)
- Use the ```head``` method to display the first few entries of ```df1```.


In [1]:
import pandas as pd
df = pd.read_stata("https://s3.amazonaws.com/assets.datacamp.com/production/course_974/datasets/individual_characteristics.dta")
# Enter code here!
df1 = df[(df['village']==1)]
df2 = df[(df['village']==2)]
df1.head()

Unnamed: 0,village,adjmatrix_key,pid,hhid,resp_id,resp_gend,resp_status,age,religion,caste,...,privategovt,work_outside,work_outside_freq,shgparticipate,shg_no,savings,savings_no,electioncard,rationcard,rationcard_colour
0,1,5,100201,1002,1,1,Head of Household,38,HINDUISM,OBC,...,PRIVATE BUSINESS,Yes,0.0,No,,No,,Yes,Yes,GREEN
1,1,6,100202,1002,2,2,Spouse of Head of Household,27,HINDUISM,OBC,...,,,,No,,No,,Yes,Yes,GREEN
2,1,23,100601,1006,1,1,Head of Household,29,HINDUISM,OBC,...,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN
3,1,24,100602,1006,2,2,Spouse of Head of Household,24,HINDUISM,OBC,...,PRIVATE BUSINESS,No,,Yes,1.0,Yes,1.0,Yes,No,
4,1,27,100701,1007,1,1,Head of Household,58,HINDUISM,OBC,...,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN


# Exercise 2

**Instructions**

- In this dataset, each individual has a personal ID, or PID, stored in ```key_vilno_1.csv``` and ```key_vilno_2.csv``` for villages 1 and 2, respectively. ```data_filepath``` contains the base URL to the datasets used in this exercise. Use ```pd.read_csv``` to read in and store ```key_vilno_1.csv``` and ```key_vilno_2.csv``` as pid1 and pid2 respectively. The csv files have no headers, so make sure to include the parameter ```header = None```.

In [2]:
pid1  = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_974/datasets/key_vilno_1.csv', delimiter = ',', header = None)
pid2  = pd.read_csv('https://s3.amazonaws.com/assets.datacamp.com/production/course_974/datasets/key_vilno_2.csv', delimiter = ',', header = None)

In [26]:
pid1.tail()

Unnamed: 0,0
838,118202
839,118301
840,118302
841,118303
842,118304


# Exercise 3

**Instructions**

- Define Python dictionaries with personal IDs as keys and a given covariate for that individual as values. Complete this for the sex, caste, and religion covariates, for Villages 1 and 2. Store these into variables named sex1, caste1, and religion1 for Village 1 and sex2, caste2, and religion2 for Village 2.


In [7]:
sex1      = dict(zip(df1.pid, df1.resp_gend))
caste1    = dict(zip(df1.pid, df1.caste))
religion1 = dict(zip(df1.pid, df1.religion))

# Continue for df2 as well.
sex2      = dict(zip(df2.pid, df2.resp_gend))
caste2    = dict(zip(df2.pid, df2.caste))
religion2 = dict(zip(df2.pid, df2.religion))

In [8]:
sex1

{100201: 1,
 100202: 2,
 100601: 1,
 100602: 2,
 100701: 1,
 100702: 2,
 100801: 1,
 100802: 2,
 100805: 2,
 100806: 1,
 100807: 1,
 100808: 2,
 101301: 2,
 101302: 2,
 101303: 2,
 101601: 1,
 101602: 2,
 102001: 1,
 102002: 2,
 102004: 2,
 102101: 1,
 102102: 2,
 102103: 1,
 102104: 2,
 102401: 1,
 102402: 2,
 102901: 1,
 102902: 2,
 103101: 1,
 103102: 2,
 103104: 2,
 103105: 2,
 103201: 1,
 103202: 2,
 103301: 1,
 103302: 2,
 103501: 1,
 103502: 2,
 103701: 1,
 103702: 2,
 104001: 1,
 104002: 2,
 104101: 1,
 104102: 2,
 104201: 1,
 104202: 2,
 104301: 1,
 104302: 2,
 104801: 1,
 104802: 2,
 104901: 1,
 104902: 2,
 105201: 1,
 105301: 1,
 105302: 2,
 105303: 1,
 105304: 2,
 105401: 1,
 105402: 2,
 105404: 2,
 105802: 2,
 105901: 1,
 105902: 2,
 106201: 1,
 106202: 2,
 106204: 1,
 106205: 2,
 106501: 1,
 106502: 1,
 106503: 2,
 106701: 1,
 106702: 2,
 106704: 2,
 106801: 1,
 106802: 2,
 107101: 1,
 107102: 2,
 107301: 1,
 107302: 2,
 107303: 1,
 107304: 2,
 107307: 1,
 107308: 2,
 107

# Exercise 4

**Instructions**

- Let's consider how much homophily exists in these networks. For a given characteristic, our measure of homophily will be the proportion of edges in the network whose constituent nodes share that characteristic. How much homophily do we expect by chance? If characteristics are distributed completely randomly, the probability that two nodes x and y share characteristic a is the probability both nodes have characteristic a, which is the frequency of a squared. The total probability that nodes x and y share their characteristic is therefore the sum of the frequency of each characteristic in the network. For example, in the dictionary favorite_colors provided, the frequency of red and blue is 1/3 and 2/3 respectively, so the chance homophily is (1/3)^2+(2/3)^2 = 5/9. Create a function ```chance_homophily(chars)``` that takes a dictionary with personal IDs as keys and characteristics as values, and computes the chance homophily for that characteristic.
- A sample of three peoples' favorite colors is given in ```favorite_colors```. Use your function to compute the chance homophily in this group, and store as ```color_homophily```.
- Print ```color_homophily```.

In [10]:

def chance_homophily(chars):
    # Enter code here!
    chars_counts_dict = Counter(chars.values())
    chars_counts = np.array(list(chars_counts_dict.values()))
    chars_props  = chars_counts / sum(chars_counts)
    return sum(chars_props**2)

favorite_colors = {
    "ankit":  "red",
    "xiaoyu": "blue",
    "mary":   "blue"
}

color_homophily = chance_homophily(favorite_colors)
print(color_homophily)

0.555555555556


# Exercise 5

**Instructions**

- ```sex1, caste1, religion1, sex2, caste2,``` and ```religion2 ```are already defined from previous exercises. Use ```chance_homophily``` to compute the chance homophily for sex, caste, and religion In Villages 1 and 2. Is the chance homophily for any attribute very high for either village?

In [12]:
# Enter your code here.
#village 1
print("Village 1 chance of same sex:", chance_homophily(sex1))
print("Village 1 chance of same castel:", chance_homophily(caste1))
print("Village 1 chance of same religion:", chance_homophily(religion1))
#village 2
print("Village 2 chance of same sex:", chance_homophily(sex2))
print("Village 2 chance of same castle:", chance_homophily(caste2))
print("Village 2 chance of same relegion:", chance_homophily(religion2))

Village 1 chance of same sex: 0.502729986168
Village 1 chance of same castel: 0.674148850979
Village 1 chance of same religion: 0.980489698852
Village 2 chance of same sex: 0.500594530321
Village 2 chance of same castle: 0.425368244801
Village 2 chance of same relegion: 1.0


# Exercise 6

**Instructions**

- Now let's compute the observed homophily in our network. Recall that our measure of homophily is the proportion of edges whose nodes share a characteristic. ```homophily(G, chars, IDs)``` takes a network ```G```, a dictionary of characteristics chars, and node IDs IDs. For each node pair, determine whether a tie exists between them, as well as whether they share a characteristic. The total count of these is ```num_same_ties``` and ```num_ties``` respectively, and their ratio is the homophily of chars in G. Complete the function by choosing where to increment ```num_same_ties``` and ```num_ties```.

In [15]:
def homophily(G, chars, IDs):
    """
    Given a network G, a dict of characteristics chars for node IDs,
    and dict of node IDs for each node in the network,
    find the homophily of the network.
    """
    num_same_ties, num_ties = 0, 0
    for n1 in G.nodes():
        for n2 in G.nodes():
            if n1 > n2:   # do not double-count edges!
                if IDs[n1] in chars and IDs[n2] in chars:
                    if G.has_edge(n1, n2):
                        # Should `num_ties` be incremented?  What about `num_same_ties`?
                        num_ties += 1
                        num_same_ties = 0
                        if chars[IDs[n1]] == chars[IDs[n2]]:
                            # Should `num_ties` be incremented?  What about `num_same_ties`?
                            num_same_ties += 1
                            num_ties = 0
    return (num_same_ties / num_ties)

# Exercise 7

**Instructions**

- The networks for Villages 1 and 2 have been stored as networkx graph objects ```G1``` and ```G2```. Use your ```homophily``` function to compute the observed homophily for sex, caste, and religion in Villages 1 and 2.
- Print all six values. Are these values higher or lower than that expected by chance?

In [None]:
print("Village 1 observed proportion of same sex:", homophily(G1, sex1, pid1))
print("Village 1 observed proportion of same caste:", homophily(G1, caste1, pid1))
print("Village 1 observed proportion of same relegion:", homophily(G1, religion1, pid1))

print("Village 2 observed proportion of same sex:", homophily(G2, sex2, pid2))
print("Village 2 observed proportion of same caste:", homophily(G2, caste2, pid2))
print("Village 2 observed proportion of same religion:", homophily(G2, religion2 , pid2))


    Village 1 observed proportion of same sex: 0.5879345603271984
    Village 1 observed proportion of same caste: 0.7944785276073619
    Village 1 observed proportion of same relegion: 0.99079754601227
    Village 2 observed proportion of same sex: 0.5622435020519836
    Village 2 observed proportion of same caste: 0.826265389876881
    Village 2 observed proportion of same religion: 1.0