# Using Python for Research Homework: Week 4, Case Study 3

Homophily is a property of networks.  Homophily occurs when nodes that are neighbors in a network also share a characteristic more often than nodes that are not network neighbors.  In this case study, we will investigate homophily of several characteristics of individuals connected in social networks in rural India.

### Exercise 1
In Exercise 1, we will calculate the chance homophily for an arbitrary characteristic. Homophily is the proportion of edges in the network whose constituent nodes share that characteristic.

How much homophily do we expect by chance? If characteristics are distributed completely randomly, the probability that two nodes \(x\) and \(y\) share characteristic \(a\) is the probability both nodes have characteristic \(a\) , which is the marginal probability of \(a\) squared. 

The total probability that nodes \(x\) and \(y\) share their characteristic is therefore the sum of the square of the marginal probabilities of each characteristic in the network.

<strong>Instructions</strong>
<ul><li>Create a function <code>marginal_prob</code> that takes a dictionary <code>chars</code> with personal IDs as keys and characteristics as values; it should return a dictionary with characteristics as keys and their marginal probability (frequency of occurence of a characteristic divided by the sum of frequencies of each characteristic) as values.</li>
</ul>

In [71]:
from collections import Counter
import numpy as np

favorite_colors = {
    "ankit":  "red",
    "xiaoyu": "blue",
    "mary":   "blue"
}

print(Counter(favorite_colors.values()))
print(sum(Counter(favorite_colors.values()).values()))

Counter({'blue': 2, 'red': 1})
3


In [57]:
def marginal_prob(chars):
    # Enter code here!
    occurences = Counter(favorite_colors.values())
    total = sum(Counter(favorite_colors.values()).values())
    
    #Alternative
    # total = sum(a.itervalues(), 0.0)
    # a = {k: v / total for k, v in a.iteritems()}
    return {char: freq/total for char, freq in occurences.items()}

In [76]:
# Testing Function
marginal_prob(favorite_colors)

{'blue': 0.6666666666666666, 'red': 0.3333333333333333}

<ul><li>Create a function <code>chance_homophily(chars)</code> that takes a dictionary <code>chars</code> defined as above and computes the chance homophily (homophily due to chance alone) for that characteristic.</li>
</ul>

In [77]:
def chance_homophily(chars):
    # Enter code here!
    marginal_probs = marginal_prob(chars)
    chance = 0

    for value in marginal_probs.values():
      chance += np.power(value, 2)

    return chance   

<ul><li>A sample of three peoples' favorite colors is given in <code>favorite_colors</code>. Use your function to compute the chance homophily in this group, and store it as <code>color_homophily</code>.</li>
<li>Print <code>color_homophily</code>.</li> </ul>

In [78]:
color_homophily = chance_homophily(favorite_colors)
print(color_homophily)

0.5555555555555556


In [80]:
# Alternative Implementation 
def marginal_prob(chars):
    frequencies = dict(Counter(chars.values()))
    sum_frequencies = sum(frequencies.values())
    return {char: freq / sum_frequencies for char, freq in frequencies.items()}
                
def chance_homophily(chars):
    marginal_probs = marginal_prob(chars)
    return np.sum(np.square(list(marginal_probs.values())))

color_homophily = chance_homophily(favorite_colors)
print(color_homophily)

0.5555555555555556


### Exercise 2

In the remaining exercises, we will calculate actual homophily in these village and compare the obtained values to those obtained by chance. In this exercise, we subset the data into individual villages and store them.

#### Instructions 

- `individual_characteristics.dta` contains several characteristics for each individual in the dataset such as age, religion, and caste. Use the `pandas` library to read in and store these characteristics as a dataframe called `df`.
- Store separate datasets for individuals belonging to Villages 1 and 2 as `df1` and `df2`, respectively.
- Note that some attributes may be missing for some individuals. In this case study, we will ignore rows of data where some column information is missing.
- Use the head method to display the first few entries of `df1`.

In [87]:
import pandas as pd

df  = pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@individual_characteristics.csv", 
                  low_memory=False, 
                  index_col=0)

df1 = df[df["village"] == 1]# Enter code here!
df2 = df[df["village"] == 2]# Enter code here!

df.head(5)

Unnamed: 0,village,adjmatrix_key,pid,hhid,resp_id,resp_gend,resp_status,age,religion,caste,subcaste,mothertongue,speakother,kannada,tamil,telugu,hindi,urdu,english,otherlang,educ,villagenative,native_name,native_type,native_taluk,native_district,res_time_yrs,res_time_mths,movereason,movecontact,movecontact_res,movecontact_hhid,movecontact_pid,movecontact_name,workflag,work_freq,work_freq_type,occupation,privategovt,work_outside,work_outside_freq,shgparticipate,shg_no,savings,savings_no,electioncard,rationcard,rationcard_colour
0,1,5,100201,1002,1,1,Head of Household,38,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,Yes,,,,,,,,,,,,,Yes,6.0,DAYS PER WEEK,BUSINESS,PRIVATE BUSINESS,Yes,0.0,No,,No,,Yes,Yes,GREEN
1,1,6,100202,1002,2,2,Spouse of Head of Household,27,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,No,data has been removed for publication,VILLAGE,data has been removed for publication,BANGALORE,16.0,,MARRIAGE,,,,,,No,,,,,,,No,,No,,Yes,Yes,GREEN
2,1,23,100601,1006,1,1,Head of Household,29,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,7TH STANDARD,Yes,,,,,,,,,,,,,Yes,8.0,HOURS PER DAY,AGRICULTURE LABOUR,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN
3,1,24,100602,1006,2,2,Spouse of Head of Household,24,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,S.S.L.C.,No,data has been removed for publication,VILLAGE,data has been removed for publication,BANGALORE,6.0,,MARRIAGE,,,,,,Yes,8.0,HOURS PER DAY,TAILOR,PRIVATE BUSINESS,No,,Yes,1.0,Yes,1.0,Yes,No,
4,1,27,100701,1007,1,1,Head of Household,58,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,S.S.L.C.,Yes,,,,,,,,,,,,,Yes,6.0,DAYS PER WEEK,AGRICULTURE CAUSUAL LABOUR,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN


In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16984 entries, 0 to 16983
Data columns (total 48 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   village            16984 non-null  int64  
 1   adjmatrix_key      16984 non-null  int64  
 2   pid                16984 non-null  int64  
 3   hhid               16984 non-null  int64  
 4   resp_id            16984 non-null  int64  
 5   resp_gend          16984 non-null  int64  
 6   resp_status        16984 non-null  object 
 7   age                16984 non-null  int64  
 8   religion           16983 non-null  object 
 9   caste              16951 non-null  object 
 10  subcaste           16984 non-null  object 
 11  mothertongue       16962 non-null  object 
 12  speakother         16984 non-null  object 
 13  kannada            16984 non-null  object 
 14  tamil              16984 non-null  object 
 15  telugu             16984 non-null  object 
 16  hindi              169

In [84]:
df.describe()

Unnamed: 0,village,adjmatrix_key,pid,hhid,resp_id,resp_gend,age,res_time_mths,movecontact_hhid,movecontact_pid,work_freq,savings_no
count,16984.0,16984.0,16984.0,16984.0,16984.0,16984.0,16984.0,3181.0,249.0,249.0,10691.0,6681.0
mean,40.947009,491.34244,4105229.0,41052.272138,2.2752,1.554404,38.994995,0.604915,34230.064257,2.453815,5.49088,0.565933
std,21.663948,334.639635,2167067.0,21670.671382,1.943071,0.497046,12.68943,2.578092,23479.874141,5.379066,1.633314,27.365636
min,1.0,1.0,100201.0,1002.0,1.0,1.0,10.0,0.0,1007.0,1.0,0.0,-999.0
25%,24.0,223.0,2407301.0,24073.0,1.0,1.0,30.0,0.0,9160.0,1.0,5.0,1.0
50%,42.0,446.5,4209004.0,42090.0,2.0,2.0,38.0,0.0,33081.0,1.0,6.0,1.0
75%,60.0,702.0,6002101.0,60021.0,2.0,2.0,48.0,0.0,56070.0,2.0,6.0,1.0
max,77.0,1703.0,7715502.0,77155.0,29.0,2.0,99.0,45.0,77153.0,55.0,49.0,8.0


### Exercise 3 

In this exercise, we define a few dictionaries that enable us to look up the sex, caste, and religion of members of each village by personal ID. For Villages 1 and 2, their personal IDs are stored as `pid`.

#### Instructions 
- Define dictionaries with personal IDs as keys and a given covariate for that individual as values. Complete this for the sex, caste, and religion covariates, for Villages 1 and 2.
- For Village 1, store these dictionaries into variables named `sex1`, `caste1`, and `religion1`.
- For Village 2, store these dictionaries into variables named `sex2`, `caste2`, and `religion2`.

In [124]:
df1.head()

Unnamed: 0,village,adjmatrix_key,pid,hhid,resp_id,resp_gend,resp_status,age,religion,caste,subcaste,mothertongue,speakother,kannada,tamil,telugu,hindi,urdu,english,otherlang,educ,villagenative,native_name,native_type,native_taluk,native_district,res_time_yrs,res_time_mths,movereason,movecontact,movecontact_res,movecontact_hhid,movecontact_pid,movecontact_name,workflag,work_freq,work_freq_type,occupation,privategovt,work_outside,work_outside_freq,shgparticipate,shg_no,savings,savings_no,electioncard,rationcard,rationcard_colour
0,1,5,100201,1002,1,1,Head of Household,38,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,Yes,,,,,,,,,,,,,Yes,6.0,DAYS PER WEEK,BUSINESS,PRIVATE BUSINESS,Yes,0.0,No,,No,,Yes,Yes,GREEN
1,1,6,100202,1002,2,2,Spouse of Head of Household,27,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,No,data has been removed for publication,VILLAGE,data has been removed for publication,BANGALORE,16.0,,MARRIAGE,,,,,,No,,,,,,,No,,No,,Yes,Yes,GREEN
2,1,23,100601,1006,1,1,Head of Household,29,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,7TH STANDARD,Yes,,,,,,,,,,,,,Yes,8.0,HOURS PER DAY,AGRICULTURE LABOUR,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN
3,1,24,100602,1006,2,2,Spouse of Head of Household,24,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,S.S.L.C.,No,data has been removed for publication,VILLAGE,data has been removed for publication,BANGALORE,6.0,,MARRIAGE,,,,,,Yes,8.0,HOURS PER DAY,TAILOR,PRIVATE BUSINESS,No,,Yes,1.0,Yes,1.0,Yes,No,
4,1,27,100701,1007,1,1,Head of Household,58,HINDUISM,OBC,VOKKALIGA,KANNADA,No,No,No,No,No,No,No,,S.S.L.C.,Yes,,,,,,,,,,,,,Yes,6.0,DAYS PER WEEK,AGRICULTURE CAUSUAL LABOUR,OTHER LAND,No,,No,,No,,Yes,Yes,GREEN


In [134]:
sex1      = {key:value for key, value in zip(df1.pid.values, df1.resp_gend.values)}# Enter code here!
caste1    = {key:value for key, value in zip(df1.pid.values, df1.caste.values)} # Enter code here!
religion1 = {key:value for key, value in zip(df1.pid.values, df1.religion.values)}# Enter code here!

# Continue for df2 as well.
sex2      = {key:value for key, value in zip(df2.pid.values, df2.resp_gend.values)}# Enter code here!
caste2    = {key:value for key, value in zip(df2.pid.values, df2.caste.values)} # Enter code here!
religion2 = {key:value for key, value in zip(df2.pid.values, df2.religion.values)}# Enter code here!

In [135]:
caste2[202802]

'OBC'

In [139]:
# Alternative solution 
sex1 = df1.set_index("pid")["resp_gend"].to_dict()
caste1 = df1.set_index("pid")["caste"].to_dict()
religion1 = df1.set_index("pid")["religion"].to_dict()

sex2 = df2.set_index("pid")["resp_gend"].to_dict()
caste2 = df2.set_index("pid")["caste"].to_dict()
religion2 = df2.set_index("pid")["religion"].to_dict()

In [140]:
df1.set_index("pid").head(2)

Unnamed: 0_level_0,village,adjmatrix_key,hhid,resp_id,resp_gend,resp_status,age,religion,caste,subcaste,mothertongue,speakother,kannada,tamil,telugu,hindi,urdu,english,otherlang,educ,villagenative,native_name,native_type,native_taluk,native_district,res_time_yrs,res_time_mths,movereason,movecontact,movecontact_res,movecontact_hhid,movecontact_pid,movecontact_name,workflag,work_freq,work_freq_type,occupation,privategovt,work_outside,work_outside_freq,shgparticipate,shg_no,savings,savings_no,electioncard,rationcard,rationcard_colour
pid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1
100201,1,5,1002,1,1,Head of Household,38,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,Yes,,,,,,,,,,,,,Yes,6.0,DAYS PER WEEK,BUSINESS,PRIVATE BUSINESS,Yes,0.0,No,,No,,Yes,Yes,GREEN
100202,1,6,1002,2,2,Spouse of Head of Household,27,HINDUISM,OBC,THIGALA,KANNADA,No,No,No,No,No,No,No,,2ND STANDARD,No,data has been removed for publication,VILLAGE,data has been removed for publication,BANGALORE,16.0,,MARRIAGE,,,,,,No,,,,,,,No,,No,,Yes,Yes,GREEN


In [141]:
df1.set_index("pid").head(2)["resp_gend"]

pid
100201    1
100202    2
Name: resp_gend, dtype: int64

In [143]:
df1.set_index("pid").head(2)["resp_gend"].to_dict()

{100201: 1, 100202: 2}

### Exercise 4

In this exercise, we will print the chance homophily of several characteristics of Villages 1 and 2. 

#### Instructions 
-  Use `chance_homophily` to compute the chance homophily for sex, caste, and religion In Villages 1 and 2. Is the chance homophily for any attribute very high for either village?

In [146]:
# Enter your code here.
print("Village 1 chance of same sex:", chance_homophily(sex1))
print("Village 1 chance of same caste:", chance_homophily(caste1))
print("Village 1 chance of same religion:", chance_homophily(religion1))

print("Village 2 chance of same sex:", chance_homophily(sex2))
print("Village 2 chance of same caste:", chance_homophily(caste2))
print("Village 2 chance of same religion:", chance_homophily(religion2))

Village 1 chance of same sex: 0.5027299861680701
Village 1 chance of same caste: 0.6741488509791551
Village 1 chance of same religion: 0.9804896988521925
Village 2 chance of same sex: 0.5005945303210464
Village 2 chance of same caste: 0.425368244800893
Village 2 chance of same religion: 1.0


### Exercise 5

In this exercise, we will create a function that computes the observed homophily given a village and characteristic.

#### Instructions 
- Complete the function `homophily()`, which takes a network `G`, a dictionary of node characteristics `chars`, and node IDs `IDs`. For each node pair, determine whether a tie exists between them, as well as whether they share a characteristic. The total count of these is `num_ties` and `num_same_ties`, respectively, and their ratio is the homophily of chars in `G`. Complete the function by choosing where to increment `num_same_ties` and `num_ties`.

In [147]:
def homophily(G, chars, IDs):
    """
    Given a network G, a dict of characteristics chars for node IDs,
    and dict of node IDs for each node in the network,
    find the homophily of the network.
    """
    num_same_ties = 0
    num_ties = 0
    for n1, n2 in G.edges():
        if IDs[n1] in chars and IDs[n2] in chars:
            if G.has_edge(n1, n2):
                # Should `num_ties` be incremented?  What about `num_same_ties`?
                num_ties += 1
                if chars[IDs[n1]] == chars[IDs[n2]]:
                    # Should `num_ties` be incremented?  What about `num_same_ties`?
                    num_same_ties += 1
    return (num_same_ties / num_ties)    

### Exercise 6

In this exercise, we will obtain the personal IDs for Villages 1 and 2. These will be used in the next exercise to calculate homophily for these villages.

#### Instructions 
- In this dataset, each individual has a personal ID, or PID, stored in `key_vilno_1.csv` and `key_vilno_2.csv` for villages 1 and 2, respectively. `data_filepath1` and `data_filepath2` contain the URLs to the datasets used in this exercise. Use `pd.read_csv` to read in and store `key_vilno_1.csv` and `key_vilno_2.csv` as `pid1` and `pid2` respectively. 

In [150]:
data_filepath1 = "https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@key_vilno_1.csv"
data_filepath2 = "https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@key_vilno_2.csv"

# Enter code here!
pid1 = pd.read_csv(data_filepath1, index_col=0)
pid2 = pd.read_csv(data_filepath2, index_col=0)

In [155]:
pid1.iloc[100]

0    102205
Name: 100, dtype: int64

### Exercise 7

In this exercise, we will compute the homophily of several network characteristics for Villages 1 and 2 and compare them to homophily due to chance alone. The networks for these villages have been stored as networkx graph objects `G1` and `G2`.

#### Instructions 

- Use your `homophily()` function to compute the observed homophily for sex, caste, and religion in Villages 1 and 2. Print all six values.
- Use the `chance_homophily()` to compare these values to chance homophily. Are these values higher or lower than that expected by chance?

In [156]:
import networkx as nx
A1 = np.array(pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@adj_allVillageRelationships_vilno1.csv", index_col=0))
A2 = np.array(pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@adj_allVillageRelationships_vilno2.csv", index_col=0))
G1 = nx.to_networkx_graph(A1)
G2 = nx.to_networkx_graph(A2)

pid1 = pd.read_csv(data_filepath1, dtype=int)['0'].to_dict()
pid2 = pd.read_csv(data_filepath2, dtype=int)['0'].to_dict()

# Enter your code here!
print(f"Observed homophily in Village 1 for Sex: {homophily(G1, sex1, pid1)}")
print(f"Observed homophily in Village 1 for Caste:{homophily(G1, caste1, pid1)}")
print(f"Observed homophily in Village 1 for Religion:{homophily(G1, religion1, pid1)}")

print(f"Observed homophily in Village 2 for Sex: {homophily(G2, sex2, pid2)}")
print(f"Observed homophily in Village 2 for Caste:{homophily(G2, caste2, pid2)}")
print(f"Observed homophily in Village 2 for Religion:{homophily(G2, religion2, pid2)}")



Observed homophily in Village 1 for Sex: 0.5908629441624366
Observed homophily in Village 1 for Caste:0.7959390862944162
Observed homophily in Village 1 for Religion:0.9908629441624366
Observed homophily in Village 2 for Sex: 0.5658073270013568
Observed homophily in Village 2 for Caste:0.8276797829036635
Observed homophily in Village 2 for Religion:1.0


In [157]:
print("Village 1 chance of same sex:", chance_homophily(sex1))
print("Village 1 chance of same caste:", chance_homophily(caste1))
print("Village 1 chance of same religion:", chance_homophily(religion1))

print("Village 2 chance of same sex:", chance_homophily(sex2))
print("Village 2 chance of same caste:", chance_homophily(caste2))
print("Village 2 chance of same religion:", chance_homophily(religion2))

Village 1 chance of same sex: 0.5027299861680701
Village 1 chance of same caste: 0.6741488509791551
Village 1 chance of same religion: 0.9804896988521925
Village 2 chance of same sex: 0.5005945303210464
Village 2 chance of same caste: 0.425368244800893
Village 2 chance of same religion: 1.0
