### 01 Series Peer 2 Peer

Since we're working with tabular data, we'll be using the pandas library. This nifty tool will turn our dataset into a dataframe, which is just a fancy way of saying a table with rows and columns. Now, we won't be using the term 'table' much in this walkthrough. So, whenever you hear 'dataframe', just picture those neat rows and columns we're talking about.

To start utilizing a library in Python, you simply import it with the 'import' statement and then name the library. For convenience, 'pd' is often used as a shorthand alias for 'pandas', so we don't have to keep typing 'pandas' repeatedly. Let's go ahead and import it. While we won't go through every single pandas method here, you can always visit the official site at https://pandas.pydata.org/docs/user_guide/index.html to explore all the methods pandas offers.

In [1]:
import pandas as pd

Python, like any other programming language, allows you to use variables to store information. This means that if you ever need to use this information again, you can simply recall the variable name.

### Variables

### Examples

Let's assign some variables to a and b. This means that we are storing these 5 to a then 7 to b

In [2]:
a = 5
b = 7

In [3]:
a, b

(5, 7)

If I print, you can see that a returns 5 whereas b returns 7. We are going to do the same and store our dataframe in a variable df. You can realize that you can get output without even using print function. but you can use it if you want, you will get the same results.

In [39]:
df = pd.read_excel("Household datasets for SEs.xlsx")

# Exploring Data with Pandas
Most times when we read a file, we want to know what our data is all about for example how many variables it has, the number of observations and so on. Pandas provides very handy methods that can do this.

- `df.shape` which returns the number of rows and columns of your table

- `df.head()` which returns by default the first 5 rows of the dataframe.

In [40]:
print(df.shape)
df.tail(2)

(92684, 14)


Unnamed: 0,district,subcounty,Parish,Cluster,village,Cohort,Cycle,Type,id,hhh_name,Contact,OX_ID,gender,age
92682,Kyenjojo,Nyantungo,Mabaale,Mabaale,Nyarugongo,2024,A,,KYE-NYA-AKU-M-130701,Akugizibwe Arikanjeru,772752000,KYE-NYA-AKU-M-130701,Male,
92683,Kyenjojo,Nyantungo,Mabaale,Mabaale,Nyarugongo,2024,A,,KYE-NYA-BRI-M-135651,Brian Twesige,777078648,KYE-NYA-BRI-M-135651,Male,


You can use the columns method that lists all the variable names, and call the `to_list()` function to wrap then in a list. A list is simply a data-structure that helps to store data. In python it's represented by closed brackets `[]`. There are other data structures including dictionaries, arrays and so on. We won't cover all of them but it's nice be aware of their existence. A very nice book I recommend about python basics is `think python` which you can find here. https://greenteapress.com/wp/think-python-2e/. It covers all the data structures used in python in detail. It's possible to do analysis without knowing python basics but knowing them will make you a better and efficient programmer. But for now, `Python for data analysis` is the best choice I can recommend for Analysis, https://wesmckinney.com/book/data-analysis-examples.html.

In [41]:
df.columns.to_list()

['district',
 'subcounty',
 'Parish',
 'Cluster',
 'village',
 'Cohort',
 'Cycle',
 'Type',
 'id',
 'hhh_name',
 'Contact',
 'OX_ID',
 'gender',
 'age']

For our randomization, we don't need all the columns, we can just pick the ones we want by defining them in a list. Let's assign a variable to the list. For now, we can name it `columns`.

In [7]:
# columns = [
#  'district',
#  'subcounty',
#  'parish_t',
#  'village',
#  'hhh_full_name',
#  'Household_Head_Age',
#  'Household_Head_Contact',
#  'Household_Head_Gender',
#  'hhid',
#  ]

In [42]:
columns = [
 'district',
 'subcounty',
 'Parish',
 'village',
 'hhh_name',
 'age',
 'Contact',
 'gender',
 'id',
 ]

When you apply a mask to your dataframe, it's like giving it a new shape, only showing the columns you tell it to. It's a bit like magic! Just remember, when you reassign 'df', you're basically telling the old one to take a hike, because there's a new dataframe in town.

In [43]:
df = df[columns]

Pandas has another interesting method that gives you the counts of each category in a particular column. It's super handy when you're trying to get a quick overview of your data and see which categories are the most common. Just a few lines of code, and you've got yourself a neat little summary of your dataset!

In [44]:
df.head()

Unnamed: 0,district,subcounty,Parish,village,hhh_name,age,Contact,gender,id
0,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akakwasa More,60.0,785534054,Female,KAG-KAS-AKA-F-075731
1,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akoragye Peter,30.0,771069272,Male,KAG-KAS-AKO-M-095616
2,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Aliganyira Nikorasi,26.0,784875889,Male,KAG-KAS-ALI-M-094647
3,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Anet Kokundeka,36.0,755236985,Female,KAG-KAS-ANE-F-093637
4,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Ayinemani Nicorous,21.0,789236760,Male,KAG-KAS-AYI-M-114920


In [12]:
df['district'].value_counts()

district
Rubirizi       12666
Kiryandongo    12554
Kibaale        10344
Kagadi          9586
Rubanda         7844
Mitooma         7261
Kyenjojo        6851
Rukungiri       5869
Kanungu         5486
Luuka           5291
Rukiga          2832
Kaliro          2738
Buhweju         1994
Kitagwenda      1368
Name: count, dtype: int64

In fact, when you're crunching numbers and diving into data analysis, tweaking a few parameters can make a world of difference. For instance, setting normalize=True is like flipping a switch that transforms raw numbers into proportions. It's a nifty trick that comes in handy almost daily, especially when you're piecing together reports and need to present your findings in a way that's easy to digest.

In [45]:
df['district'].value_counts(normalize=True) 

district
Rubirizi       0.136658
Kiryandongo    0.135449
Kibaale        0.111605
Kagadi         0.103427
Rubanda        0.084632
Mitooma        0.078341
Kyenjojo       0.073918
Rukungiri      0.063323
Kanungu        0.059190
Luuka          0.057086
Rukiga         0.030555
Kaliro         0.029541
Buhweju        0.021514
Kitagwenda     0.014760
Name: proportion, dtype: float64

### Duplicates

Since we are trying to do randomization, we don't want any case of duplicated values. Thankfully, pandas has support for us. We can just use the method drop_duplicates, and pass in the `column` where we want to get rid of duplicates. 

A duplicate means some values are repeating, so using 'first' retains the first occurrence, whereas 'last' would return the last duplicated value.

In [46]:
df = df.drop_duplicates('id', keep='first')

Filtering is a common task in data manipulation, especially when you need to focus on a specific subset of data. For instance, if you're only interested in analyzing data from Kiryandongo district, you can easily filter out the rest. Here's how you do it: simply use the filtering code (==) to isolate Kiryandongo's data. The beauty of it is that you can assign this filtered data to a new variable, keeping your original dataframe intact for further use.

### Filtering

In [17]:
Kiryandongo = df[df['district'] == 'Kiryandongo']

In [13]:
Kiryandongo.shape

(11905, 9)

### Creating new variables

In [50]:
df['gender'].value_counts()

gender
Male      60892
Female    20493
Name: count, dtype: int64

In [48]:
gender_mapping = {
    'male': 'Male',
    'M':'Male',
    'female': 'Female',
    'F': 'Female',
    'Feamale': 'Female',
    'Female`': 'Female',
    'FemaLe':'Female',
    'Femae': 'Female',
    'Male ': 'Male'
}

In [49]:
df['gender'] = df['gender'].str.lower().map(gender_mapping)

We have females and males but we need to include youths in our sampling. Its a bad idea to always change values of an original variable. Let's create a new variable and make all our changes on it. Generating an new variable is very simple using pandas, Just mask the dataframe with the name of the new variable on the left. The right should have the content of this new variable. The code below just shows that generate a new variable, name it Gender and populate it values of Household head gender.

In [32]:
df['Gender'] = df['gender']

If you print, the first two rows, you can notice that the Household_Head_Gender and Gender variable have the same contents

In [33]:
df.head(2)

Unnamed: 0,district,subcounty,Parish,village,hhh_name,age,Contact,gender,id,Gender
0,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akakwasa More,60.0,785534054,Female,KAG-KAS-AKA-F-075731,Female
1,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akoragye Peter,30.0,771069272,Male,KAG-KAS-AKO-M-095616,Male


## functions

In programming, functions are like your go-to tools for tasks you perform often. When it comes to data analysis, I've found that lambda functions are just slicker and quicker for whipping things into shape. Take this code below for instance.

Here's the breakdown:
- Spot a row with a value 30 or less? Tag it as 'Youth Headed'.
- Otherwise, keep it as 'Male' or 'Female', but tack on 'Headed' to get 'Male Headed' or 'Female Headed'.

In [51]:
df['gender'] = df.apply(
    lambda row: 'Youth Headed' 
    if row['age'] <= 30 
    else str(row['gender']) + ' Headed', axis=1
)

The code below does exactly what we have done above, but you realize it's more complex than the latter. You might not prefer this but its good to understand how functions work because sometimes they are the deal.

In [31]:
# def add_age(df, gender, age):
#     for index, row in df.iterrows():
#         if row[age] <= 30:
#             df.at[index, gender] = 'Youth Headed'
#         else:
#             df.at[index, gender] = str(row[variable]) + ' Headed'

We can apply the function like this by passing in the parameters, I did both ways, but the first one is even faster than this.

In [19]:
# add_age(df, 'Household_Head_Gender','Household_Head_Age')

In [52]:
df.head()

Unnamed: 0,district,subcounty,Parish,village,hhh_name,age,Contact,gender,id
0,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akakwasa More,60.0,785534054,Female Headed,KAG-KAS-AKA-F-075731
1,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Akoragye Peter,30.0,771069272,Youth Headed,KAG-KAS-AKO-M-095616
2,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Aliganyira Nikorasi,26.0,784875889,Youth Headed,KAG-KAS-ALI-M-094647
3,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Anet Kokundeka,36.0,755236985,Female Headed,KAG-KAS-ANE-F-093637
4,Kagadi,Kicucuuru,Kitooro,Kasozi_Kitooro,Ayinemani Nicorous,21.0,789236760,Youth Headed,KAG-KAS-AYI-M-114920


In [53]:
df['Household_Head_Gender'] = df['gender']

In [54]:
df['Household_Head_Gender'].value_counts()

Household_Head_Gender
Male Headed      42341
Youth Headed     21873
Female Headed    17421
nan Headed        6067
Name: count, dtype: int64

Violla, you can see that we now have three categories with Youth included. We can get rid of the Gender and age columns since we nolonger need them. Pandas has another drop method. You just pass in a list of the columns you want to delete. Setting Inplace to true just makes the change parmanent. False makes it temporary.

### Dropping columns

In [55]:
df.drop(columns=['gender',], inplace=True)

We can make filter subsets for these gender types using the tricks that we've learnt and store them in variables.

In [56]:
male_HH_samp_frame = df[df['Household_Head_Gender']=='Male Headed']
female_HH_samp_frame = df[df['Household_Head_Gender']=='Female Headed']
youth_HH_samp_frame = df[df['Household_Head_Gender']=='Youth Headed']

In [57]:
male_HH_samp_frame.shape, female_HH_samp_frame.shape, youth_HH_samp_frame.shape

((42341, 9), (17421, 9), (21873, 9))

In [58]:
# male_HH_samp_frame.to_excel('mit/male_HH_samp_frame.xlsx', index=False)
# female_HH_samp_frame.to_excel('mit/female_HH_samp_frame.xlsx', index=False)
# youth_HH_samp_frame.to_excel('mit/youth_HH_samp_frame.xlsx', index=False)

If you want to store these subsets as excel files you can use the to_excel method and pass in the path.

To take it slow, we are going to sample males, females and youths separately

Using our distribution
- For villages > 100 samples we use 60 20 20 (Male Female Youths)
- otherwise we use 50 25 25 (Male Female Youths)

can we incorporate this:
for the sampling function

For villages > 100 samples we use 
60% of 30  for the males
20% of 30  for the females
20% of 30  for the youths

For villages < 100 samples we use
50% of 24 works is for the males
250% of 24 and for the females
25% of 24 and for the youths

for n >= 100 we need 30 sample, 
for n < 100 we need 24 samples

## Male Headed

In [59]:
len(male_HH_samp_frame)

42341

In [60]:
# 60% of 30 and 60% of 24
print(f"60% of 30 is {0.6*30} whereas 50% of 24 is {0.5*24}")

60% of 30 is 18.0 whereas 50% of 24 is 12.0


In [61]:
sample_size = int(0.6*30)
sample_size

18

In [62]:
threshold_count = 100

In [63]:
village_counts = df['village'].value_counts()
village_counts

village
Ikumbya       637
Ntayigirwa    513
Wandago       491
Bujogoro_A    452
Nsambya       433
             ... 
Buwumba         8
Kafuro_3        1
Kafuro_4        1
Kafuro_6        1
Kafuro_5        1
Name: count, Length: 788, dtype: int64

In [67]:
village_counts[village_counts < 30]

village
Kashenshero_A        27
Mugurante_Kiyembe    27
Kibaale_North        25
Nyamatongo_1         23
Kamakora             22
Buwumba               8
Kafuro_3              1
Kafuro_4              1
Kafuro_6              1
Kafuro_5              1
Name: count, dtype: int64

In [37]:
df['village'].iloc[2]

'Kihunga_Nyakishowa'

In [41]:
village_counts.get('Kiogoma_1',0)

850

What this function does is that it takes a sample, picks the village name by index, checks wether the sample number is greater than or equal to the threshold. Does the sampling depending on the number.

In [38]:
def custom_sample(x):
    village_name = x['village'].iloc[0]
    if village_counts.get(village_name, 0) >= 100:
        return x.sample(n=min(len(x), 18))
    else:
        return x.sample(n=min(len(x), 12))

Apply just works as our old lambda function, whereby you can 

In [41]:
male_target_sample = male_HH_samp_frame.groupby('village', group_keys=False).apply(custom_sample)

In [42]:
len(male_target_sample)

6787

We can generate a new column and assign a value target

In [43]:
male_target_sample['status'] = 'target'

In [44]:
male_target_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
1740,Buhweju,Kyahenda,Nyamihira,Akatojo,Jackson Musinguzi,0788-81-29-04,Male Headed,Buh-Aka-Jac-M-180309-14,target
3363,Buhweju,Kyahenda,Nyamihira,Akatojo,Pauson Katunguka,0772-98-63-99,Male Headed,Buh-Aka-Pau-M-100640-14,target
1737,Buhweju,Kyahenda,Nyamihira,Akatojo,Wilson Tuhame,0788-65-73-74,Male Headed,Buh-Aka-Wil-M-173148-14,target
3369,Buhweju,Kyahenda,Nyamihira,Akatojo,John Bundaga,0788-65-73-74,Male Headed,Buh-Aka-Joh-M-110920-14,target
3359,Buhweju,Kyahenda,Nyamihira,Akatojo,Isamen Nowamaani,0760-96-99-69,Male Headed,Buh-Aka-Isa-M-090850-14,target


Let's try to check the number of samples in Kako village, its 18

In [46]:
len(male_target_sample[male_target_sample['village'] == 'Kako'])

18

In [47]:
# male_target_sample.to_excel('mit/male_target_sample.xlsx', index=False)

But we need some reserves, let's do the same

## Reserve frame

we can drop the targets since we've already assigned them using the index

In [48]:
male_reserve_s_frame = male_HH_samp_frame.drop(index = male_target_sample.index)

In [49]:
# male_reserve_s_frame.to_excel('mit/male_reserve_s_frame.xlsx', index=False)

In [50]:
male_reserve_sample = male_reserve_s_frame.groupby('village', group_keys=False).apply(custom_sample)

In [51]:
male_reserve_sample['status'] = 'reserve'
male_reserve_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
1726,Buhweju,Kyahenda,Nyamihira,Akatojo,Yowasi Nyangi,0788-65-73-74,Male Headed,Buh-Aka-Yow-M-151028-14,reserve
1719,Buhweju,Kyahenda,Nyamihira,Akatojo,Robert Manzi,0783-62-75-05,Male Headed,Buh-Aka-Rob-M-134714-14,reserve
1721,Buhweju,Kyahenda,Nyamihira,Akatojo,Gard Sabit,0788-65-73-74,Male Headed,Buh-Aka-Gar-M-140928-14,reserve
3378,Buhweju,Kyahenda,Nyamihira,Akatojo,Justus Banturaki,0778-76-35-38,Male Headed,Buh-Aka-Jus-M-123623-14,reserve
1728,Buhweju,Kyahenda,Nyamihira,Akatojo,Fuderi Tumusime,0772-47-91-23,Male Headed,Buh-Aka-Fud-M-153058-14,reserve


In [52]:
len(male_reserve_sample[male_reserve_sample['village'] == 'Kako'])

18

In [53]:
# male_reserve_sample.to_excel('mit/male_reserve_sample.xlsx', index=False)

Let's combine the targets and reserves to come up with a complete sample for males

# Male Household Sample [Combining the Target and Reserve]

In [54]:
male_target_sample.shape, male_reserve_sample.shape

((6787, 9), (6614, 9))

In [55]:
MALE_HH_SAMPLE = pd.concat([male_target_sample, male_reserve_sample])

If you have data frames with primary keys, you can rather join using merge, You can google it up.

In [56]:
# MALE_HH_SAMPLE.to_excel('mit/MALE_HH_SAMPLE.xlsx', index=False)

We are done, let's simply do the same for females and youths

# FEMALE  Headed

In [57]:
len(female_HH_samp_frame)

12448

In [58]:
female_HH_samp_frame.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid
0,Mitooma,Mitooma,Nyakishojwa,Kashasha,Kyomuhangi Jadress,0763-25-28-10,Female Headed,Mit-Kas-Kyo-F-082932-5
1,Mitooma,Mitooma,Nyakishojwa,Karoza_A,Nuwagaba Charity,0778-56-33-66,Female Headed,Mit-Kar-Nuw-F-082625-5
2,Mitooma,Mitooma,Nyakishojwa,Kihunga_Nyakishowa,Twogirwe Mollias,0773-35-56-76,Female Headed,Mit-Kih-Two-F-083135-5
4,Mitooma,Mitooma,Nyakishojwa,Kibisho_A,Mauda Ndebwomwe,0706-94-93-06,Female Headed,Mit-Kib-Mau-F-083009-5
6,Mitooma,Mitooma,Nyakishojwa,Karoza_A,komwani Mary,0775-65-31-17,Female Headed,Mit-Kar-kom-F-083805-5


You realize here that I am not using the custom function, because 20% of 30 and 25% of 24 all give 6 samples. We can just sample

In [59]:
female_target_sample = female_HH_samp_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

In [60]:
len(female_target_sample)

2702

In [61]:
female_target_sample['status'] = 'target'

In [62]:
female_target_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
3367,Buhweju,Kyahenda,Nyamihira,Akatojo,Hairet Natukunda,0788-65-73-74,Female Headed,Buh-Aka-Hai-F-105346-14,target
3383,Buhweju,Kyahenda,Nyamihira,Akatojo,Kyomuhangi Jane,0788-65-72-74,Female Headed,Buh-Aka-Kyo-F-135222-14,target
1735,Buhweju,Kyahenda,Nyamihira,Akatojo,Adrine Tumuhirye,0784-80-38-86,Female Headed,Buh-Aka-Adr-F-170549-14,target
1713,Buhweju,Kyahenda,Nyamihira,Akatojo,Anitah Natukwasa,0789-53-44-09,Female Headed,Buh-Aka-Ani-F-124217-14,target
3381,Buhweju,Kyahenda,Nyamihira,Akatojo,Asiya Karunga,0788-65-73-74,Female Headed,Buh-Aka-Asi-F-132140-14,target


In [63]:
# female_target_sample.to_excel('mit/female_target_sample.xlsx', index=False)

## Reserve frame

In [64]:
female_reserve_s_frame = female_HH_samp_frame.drop(index = female_target_sample.index)
female_reserve_s_frame.shape

(9746, 8)

In [65]:
# female_reserve_s_frame.to_excel('mit/female_reserve_s_frame.xlsx', index=False)

In [66]:
female_reserve_sample = female_reserve_s_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

In [67]:
female_reserve_sample['status'] = 'reserve'
female_reserve_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
1725,Buhweju,Kyahenda,Nyamihira,Akatojo,Dovina Tukamuhabwa,0788-65-73-74,Female Headed,Buh-Aka-Dov-F-150345-14,reserve
1739,Buhweju,Kyahenda,Nyamihira,Akatojo,Provia Tumusiime,0788-65-73-74,Female Headed,Buh-Aka-Pro-F-175529-14,reserve
1717,Buhweju,Kyahenda,Nyamihira,Akatojo,Lidya Tumworebere,0788-65-73-74,Female Headed,Buh-Aka-Lid-F-132006-14,reserve
1732,Buhweju,Kyahenda,Nyamihira,Akatojo,Mary Turyamuhebwa,0779-44-60-29,Female Headed,Buh-Aka-Mar-F-163632-14,reserve
3364,Buhweju,Kyahenda,Nyamihira,Akatojo,Lillian Kibatenga,0788-65-73-74,Female Headed,Buh-Aka-Lil-F-101914-14,reserve


In [68]:
# female_reserve_sample.to_excel('mit/female_reserve_sample.xlsx', index=False)

In [69]:
FEMALE_HH_SAMPLE = pd.concat([female_target_sample, female_reserve_sample])

In [70]:
# FEMALE_HH_SAMPLE.to_excel('mit/FEMALE_HH_SAMPLE.xlsx', index=False)

# YOUTH Households

In [71]:
len(youth_HH_samp_frame)

14634

In [72]:
youth_HH_samp_frame.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid
10,Mitooma,Mitooma,Nyakishojwa,Kashasha,Ainomujuni Mavini,0706-55-49-45,Youth Headed,Mit-Kas-Ain-M-085223-5
20,Mitooma,Mitooma,Nyakishojwa,Kashasha,Twesigye Gilbert,0787-12-15-05,Youth Headed,Mit-Kas-Twe-M-091417-5
25,Mitooma,Mitooma,Nyakishojwa,Kihunga_Nyakishowa,Tumwine Asaph,0775-14-86-49,Youth Headed,Mit-Kih-Tum-M-091142-5
40,Mitooma,Mitooma,Nyakishojwa,Nyakishojwa_Central,Amutuhaire Sikora,0773-15-11-61,Youth Headed,Mit-Nya-Amu-F-095910-5
47,Buhweju,Kyahenda,Kiyanja,Kakonkoma,Ronald Keinerugaba,0780-10-71-85,Youth Headed,Buh-Kak-Ron-M-083202-14


In [73]:
youth_target_sample = youth_HH_samp_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

In [74]:
len(youth_target_sample)

2643

In [75]:
youth_target_sample['status'] = 'target'

In [76]:
youth_target_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
3375,Buhweju,Kyahenda,Nyamihira,Akatojo,Adson Oweyesigire,0778-57-44-93,Youth Headed,Buh-Aka-Ads-M-120945-14,target
4082,Buhweju,Kyahenda,Nyamihira,Akatojo,Siliverio Ashabahebwa,0787-37-92-51,Youth Headed,Buh-Aka-Sil-M-142549-14,target
3371,Buhweju,Kyahenda,Nyamihira,Akatojo,Babra Barbra,0782-56-98-77,Youth Headed,Buh-Aka-Bab-F-113109-14,target
276,Buhweju,Kyahenda,Nyamihira,Akatojo,Ivan Nuwagira,0704-56-42-99,Youth Headed,Buh-Aka-Iva-M-104508-14,target
3368,Buhweju,Kyahenda,Nyamihira,Akatojo,Boaz Arinitwe,0778-25-86-41,Youth Headed,Buh-Aka-Boa-M-105949-14,target


In [77]:
# youth_target_sample.to_excel('mit/youth_target_sample.xlsx', index=False)

## Reserve frame

In [78]:
youth_reserve_s_frame = youth_HH_samp_frame.drop(index = youth_target_sample.index)
youth_reserve_s_frame.shape

(11991, 8)

In [79]:
# youth_reserve_s_frame.to_excel('mit/youth_reserve_s_frame.xlsx', index=False)

In [80]:
youth_reserve_sample = youth_reserve_s_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

In [81]:
youth_reserve_sample['status'] = 'reserve'
youth_reserve_sample.head()

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
3361,Buhweju,Kyahenda,Nyamihira,Akatojo,Ambrose Tumusingize,0781-78-11-27,Youth Headed,Buh-Aka-Amb-M-095140-14,reserve
1736,Buhweju,Kyahenda,Nyamihira,Akatojo,Ronard Mbabazi,0783-61-62-11,Youth Headed,Buh-Aka-Ron-M-172233-14,reserve
273,Buhweju,Kyahenda,Nyamihira,Akatojo,Asani Nuwamanya,0753-46-83-52,Youth Headed,Buh-Aka-Asa-M-101415-14,reserve
3372,Buhweju,Kyahenda,Nyamihira,Akatojo,Ronard Muhweju,0763-93-28-08,Youth Headed,Buh-Aka-Ron-M-114224-14,reserve
274,Buhweju,Kyahenda,Nyamihira,Akatojo,Bashiru Arinitwe,0789-42-66-54,Youth Headed,Buh-Aka-Bas-M-102520-14,reserve


In [82]:
# youth_reserve_sample.to_excel('mit/youth_reserve_sample.xlsx', index=False)

# Youth Household Sample

In [83]:
youth_target_sample.shape, youth_reserve_sample.shape

((2643, 9), (2263, 9))

In [84]:
YOUTH_HH_SAMPLE = pd.concat([youth_target_sample, youth_reserve_sample])

In [85]:
# YOUTH_HH_SAMPLE.to_excel('mit/YOUTH_HH_SAMPLE.xlsx', index=False)

We now have sample for males, females, and youths. Let's combine.

# JOINING THE DATAFRAMES

In [88]:
MALE_HH_SAMPLE.shape, FEMALE_HH_SAMPLE.shape, YOUTH_HH_SAMPLE.shape

((13401, 9), (5212, 9), (4906, 9))

In [5]:
import numpy as np

In [91]:
np.array((MALE_HH_SAMPLE.shape[0], FEMALE_HH_SAMPLE.shape[0], YOUTH_HH_SAMPLE.shape[0])).sum(), len(df)

(23519, 56187)

In [92]:
FINAL = pd.concat([MALE_HH_SAMPLE, FEMALE_HH_SAMPLE, YOUTH_HH_SAMPLE])

In [94]:
FINAL.to_excel('FINAL_Track_sheets.xlsx', index=False)

In [95]:
FINAL

Unnamed: 0,district,subcounty,parish_t,village,hhh_full_name,Household_Head_Contact,Household_Head_Gender,hhid,status
1740,Buhweju,Kyahenda,Nyamihira,Akatojo,Jackson Musinguzi,0788-81-29-04,Male Headed,Buh-Aka-Jac-M-180309-14,target
3363,Buhweju,Kyahenda,Nyamihira,Akatojo,Pauson Katunguka,0772-98-63-99,Male Headed,Buh-Aka-Pau-M-100640-14,target
1737,Buhweju,Kyahenda,Nyamihira,Akatojo,Wilson Tuhame,0788-65-73-74,Male Headed,Buh-Aka-Wil-M-173148-14,target
3369,Buhweju,Kyahenda,Nyamihira,Akatojo,John Bundaga,0788-65-73-74,Male Headed,Buh-Aka-Joh-M-110920-14,target
3359,Buhweju,Kyahenda,Nyamihira,Akatojo,Isamen Nowamaani,0760-96-99-69,Male Headed,Buh-Aka-Isa-M-090850-14,target
...,...,...,...,...,...,...,...,...,...
28042,Kagadi,Kanyabeebe,Kashagali,Yeruzalemu_B,Keresi Niwamanya,0771-22-54-91,Youth Headed,Kag-Yer-Ker-M-140235-9,reserve
27967,Kagadi,Kanyabeebe,Kashagali,Yeruzalemu_C,Kukahirwa Busobozi,0763-81-93-13,Youth Headed,Kag-Yer-Kuk-M-135234-9,reserve
27957,Kagadi,Kanyabeebe,Kashagali,Yeruzalemu_C,Katurebe Asafu,0785-57-26-34,Youth Headed,Kag-Yer-Kat-M-121226-9,reserve
27962,Kagadi,Kanyabeebe,Kashagali,Yeruzalemu_C,Maniraguha Mozesi,0785-57-26-34,Youth Headed,Kag-Yer-Man-M-131059-9,reserve


In [102]:
len(FINAL[FINAL['village'] == 'Yeruzalemu_C'])

34

## There is a term DRY used in software engineering meaning Dont Repeat yourself. There is alot of code repeating itself. However much the code works and does what it is intended to do, If an expert reads the code, they can easily grade you as a beginner.
- Let's see in the next series how we can write high quality code. If you curious you can try it yourself.
- You can also make this a script other than a jupyter notebook, can you do that. Give it a try. Notebooks are great for data analysis and exploratory anaysis plus experimenting things. But if your goal is to become a serious programmer, they might not take you that far. It's so easy to feel lazy to

In [100]:
# grouped_data = FINAL.groupby('status')

# with pd.ExcelWriter('mit/01_Mitooma.xlsx', engine='openpyxl') as writer:
#     for status, group_df in grouped_data:
#         group_df.to_excel(writer, sheet_name=f'{status}', index=False)

In [23]:
male_HH_samp_frame = df[df['Household_Head_Gender']=='Male Headed']
female_HH_samp_frame = df[df['Household_Head_Gender']=='Female Headed']
youth_HH_samp_frame = df[df['Household_Head_Gender']=='Youth Headed']

male_HH_samp_frame.shape, female_HH_samp_frame.shape, youth_HH_samp_frame.shape

# male_HH_samp_frame.to_excel('mit/male_HH_samp_frame.xlsx', index=False)
# female_HH_samp_frame.to_excel('mit/female_HH_samp_frame.xlsx', index=False)
# youth_HH_samp_frame.to_excel('mit/youth_HH_samp_frame.xlsx', index=False)

If you want to store these subsets as excel files you can use the to_excel method and pass in the path.

To take it slow, we are going to sample males, females and youths separately

Using our distribution
- For villages > 100 samples we use 60 20 20 (Male Female Youths)
- otherwise we use 50 25 25 (Male Female Youths)

for n >= 100 we need 30 sample, 
for n < 100 we need 24 samples

## Male Headed

len(male_HH_samp_frame)

# 60% of 30 and 60% of 24
print(f"60% of 30 is {0.6*30} whereas 50% of 24 is {0.5*24}")

sample_size = int(0.6*30)
sample_size

threshold_count = 100

village_counts = df['village'].value_counts()
village_counts

df['village'].iloc[2]

village_counts.get('Kiogoma_1',0)

What this function does is that it takes a sample, picks the village name by index, checks wether the sample number is greater than or equal to the threshold. Does the sampling depending on the number.

def custom_sample(x):
    village_name = x['village'].iloc[0]
    if village_counts.get(village_name, 0) >= 100:
        return x.sample(n=min(len(x), 18))
    else:
        return x.sample(n=min(len(x), 12))

Apply just works as our old lambda function, whereby you can 

male_target_sample = male_HH_samp_frame.groupby('village', group_keys=False).apply(custom_sample)

len(male_target_sample)

We can generate a new column and assign a value target

male_target_sample['status'] = 'target'

male_target_sample.head()

Let's try to check the number of samples in Kako village, its 18

len(male_target_sample[male_target_sample['village'] == 'Kako'])

# male_target_sample.to_excel('mit/male_target_sample.xlsx', index=False)

But we need some reserves, let's do the same

## Reserve frame

we can drop the targets since we've already assigned them using the index

male_reserve_s_frame = male_HH_samp_frame.drop(index = male_target_sample.index)

# male_reserve_s_frame.to_excel('mit/male_reserve_s_frame.xlsx', index=False)

male_reserve_sample = male_reserve_s_frame.groupby('village', group_keys=False).apply(custom_sample)

male_reserve_sample['status'] = 'reserve'
male_reserve_sample.head()

len(male_reserve_sample[male_reserve_sample['village'] == 'Kako'])

# male_reserve_sample.to_excel('mit/male_reserve_sample.xlsx', index=False)

Let's combine the targets and reserves to come up with a complete sample for males

# Male Household Sample [Combining the Target and Reserve]

male_target_sample.shape, male_reserve_sample.shape

MALE_HH_SAMPLE = pd.concat([male_target_sample, male_reserve_sample])

If you have data frames with primary keys, you can rather join using merge, You can google it up.

# MALE_HH_SAMPLE.to_excel('mit/MALE_HH_SAMPLE.xlsx', index=False)

We are done, let's simply do the same for females and youths

# FEMALE  Headed

len(female_HH_samp_frame)

female_HH_samp_frame.head()

You realize here that I am not using the custom function, because 20% of 30 and 25% of 24 all give 6 samples. We can just sample

female_target_sample = female_HH_samp_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

len(female_target_sample)

female_target_sample['status'] = 'target'

female_target_sample.head()

# female_target_sample.to_excel('mit/female_target_sample.xlsx', index=False)

## Reserve frame

female_reserve_s_frame = female_HH_samp_frame.drop(index = female_target_sample.index)
female_reserve_s_frame.shape

# female_reserve_s_frame.to_excel('mit/female_reserve_s_frame.xlsx', index=False)

female_reserve_sample = female_reserve_s_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

female_reserve_sample['status'] = 'reserve'
female_reserve_sample.head()

# female_reserve_sample.to_excel('mit/female_reserve_sample.xlsx', index=False)

FEMALE_HH_SAMPLE = pd.concat([female_target_sample, female_reserve_sample])

# FEMALE_HH_SAMPLE.to_excel('mit/FEMALE_HH_SAMPLE.xlsx', index=False)

# YOUTH Households

len(youth_HH_samp_frame)

youth_HH_samp_frame.head()

youth_target_sample = youth_HH_samp_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

len(youth_target_sample)

youth_target_sample['status'] = 'target'

youth_target_sample.head()

# youth_target_sample.to_excel('mit/youth_target_sample.xlsx', index=False)

## Reserve frame

youth_reserve_s_frame = youth_HH_samp_frame.drop(index = youth_target_sample.index)
youth_reserve_s_frame.shape

# youth_reserve_s_frame.to_excel('mit/youth_reserve_s_frame.xlsx', index=False)

youth_reserve_sample = youth_reserve_s_frame.groupby('village', group_keys=False).apply(lambda x: x.sample(n=min(len(x), 6)))

youth_reserve_sample['status'] = 'reserve'
youth_reserve_sample.head()

# youth_reserve_sample.to_excel('mit/youth_reserve_sample.xlsx', index=False)

# Youth Household Sample

youth_target_sample.shape, youth_reserve_sample.shape

YOUTH_HH_SAMPLE = pd.concat([youth_target_sample, youth_reserve_sample])

# YOUTH_HH_SAMPLE.to_excel('mit/YOUTH_HH_SAMPLE.xlsx', index=False)

We now have sample for males, females, and youths. Let's combine.

# JOINING THE DATAFRAMES

MALE_HH_SAMPLE.shape, FEMALE_HH_SAMPLE.shape, YOUTH_HH_SAMPLE.shape

import numpy as np

np.array((MALE_HH_SAMPLE.shape[0], FEMALE_HH_SAMPLE.shape[0], YOUTH_HH_SAMPLE.shape[0])).sum(), len(df)

FINAL = pd.concat([MALE_HH_SAMPLE, FEMALE_HH_SAMPLE, YOUTH_HH_SAMPLE])