# EA x DKSG
**classification:**
classify organizations into cause areas based on descriptions
basically, descriptions -> cause areas

## methodology
- clean up organization descriptions
    - drop stopwords
    - drop punctuations
    - lemmatize words
- clean up ea keywords
- create count vectors for each description
- use count vectors as feature for classification with
    - linear regressor
    - decision trees
    - random forest regressor
    - deep LSTM
    
## hypothesis
an organization with certain keywords in its description should be involved in a cause area if the keywords match up with the keywords defined in the cause area

## clean up organization descriptions

In [29]:
## setup
%run env_setup.py
%run filepaths.py
%run helpers.py

In [22]:
web_df = read_from_csv(WEB_SCRAPE_CSV)

In [33]:
web_df.head()

Unnamed: 0,name,description,website,cause_area,programme_types,address,country,city,contact_number,email,...,IPC.Status,IPC.Period,Details.URL,fax,revenue,employees,lat,lon,desc_clean_words,desc_clean
0,Bali Pink Ribbon - Breast Cancer Awareness Fou...,"Bali Pink Ribbon was founded by Gaye Warren, a...",http://www.balipinkribbon.com,"Balinese People, Health",Our vision is to prevent Indonesian women from...,80113 Dauh Puri Kauh (Denpasar Barat),indonesia,Denpasar,+62 361 4746238,pr@balipinkribbon.com,...,,,,,,,-8.673612,115.203737,"[bali, pink, ribbon, found, gaye, warren, brit...",bali pink ribbon found gaye warren british bre...
1,Volunteer Programs Bali,"At VP Bali, we believe that education can chan...",http://http://volunteerprogramsbali.org,"Balinese People, Children, Balinese Art & Culture",A dopting the Balinese values and its premise ...,80571 Ubud (Petulu),indonesia,Ubud,,info@volunteerprogramsbali.org,...,,,,,,,-8.67509,115.189919,"[at, vp, bali, believe, education, change, chi...",at vp bali believe education change childã¢ââ...
2,NGO 4 Ger,Only for specials,http://www.fiedbeck.de,Balinese People,,Strasse 1 Bali,indonesia,Amed,+49 7531 123456,h.vergara@fiedbeck.de,...,,,,,,,-8.409518,115.188916,"[only, special]",only special
3,Bali Children Foundation,"Our vision, at Bali Children Foundation, is to...",http://www.balichildrenfoundation.org,"Children, Education & Schools",To provide community education in remote areas...,"Jl. Raya Kesambi No.369, Kerobokan, Kuta Utara...",indonesia,Seminyak,+62 361 847 5399,info@balichildrenfoundation.org,...,,,,,,,-8.653567,115.172545,"[our, vision, bali, children, foundation, prov...",our vision bali children foundation provide ed...
4,Friends of the National Parks Foundation,Friends of the National Park Foundation (FNPF)...,http://www.fnpf.org,Environment & Nature Conservation,"To protect wildlife and its habitat, at the sa...","Ped, Nusapenida, Klungkung Regency, Bali 80771...",indonesia,Pejeng,+62 361 4792286,info@fnpf.org,...,,,,,,,-8.68388,115.518827,"[friends, national, park, foundation, fnpf, in...",friends national park foundation fnpf indonesi...


In [25]:
## fill up empty descriptions
web_df['description'] = web_df['description'].fillna('')

## add cleaned description
web_df['desc_clean_words'] = get_cleaned_descriptions(list(web_df['description']), True, True, True)
web_df['desc_clean'] = get_sentence_from_list(list(web_df['desc_clean_words']))

In [26]:
web_df['desc_clean'].head()

0    bali pink ribbon found gaye warren british bre...
1    at vp bali believe education change childã¢ââ...
2                                         only special
3    our vision bali children foundation provide ed...
4    friends national park foundation fnpf indonesi...
Name: desc_clean, dtype: object

In [35]:
web_df['desc_clean_words'].head()

0    [bali, pink, ribbon, found, gaye, warren, brit...
1    [at, vp, bali, believe, education, change, chi...
2                                      [only, special]
3    [our, vision, bali, children, foundation, prov...
4    [friends, national, park, foundation, fnpf, in...
Name: desc_clean_words, dtype: object

In [57]:
## add cleaned cause areas
def clean_cause_area(cause_area_raw):
    words = cause_area_raw.lower().split(",")
    words = [w.strip() for w in words]
    words = [w for w in words if len(w) > 0]
    return set(words)

web_df['cause_area'] = web_df['cause_area'].fillna('')
web_df['cause_area_clean'] = web_df['cause_area'].apply(clean_cause_area)

In [58]:
web_df['cause_area_clean'].head()

0                            {health, balinese people}
1    {children, balinese art & culture, balinese pe...
2                                    {balinese people}
3                      {children, education & schools}
4                  {environment & nature conservation}
Name: cause_area_clean, dtype: object

## clean up ea keywords

In [30]:
ea_df = read_from_csv(EA_CSV)

In [34]:
ea_df.head()

Unnamed: 0,Causes/ Columns,Keywords_Set 1,Keywords_Set 2,Yad's comments
0,Health infectious diseases,"HIV, AIDs, Tuberculosis, Clinic, Hepatitis, De...","HIV, AIDs, Tuberculosis, Hepatitis, Dengue, Ma...",
1,Neglected tropical diseases (NTDs),"Deworming, parasitic worms, neglected tropical...","Deworming, parasitic worms, neglected tropical...",
2,Social Enterprise,"Social Entrepreneur, business, Entrepreneurshi...","Social Entrepreneur, Entrepreneurship",
3,Environment,"Recycle, Water, plastic, nature, fishery, farm...","Recycle, plastic, pollution, natural resources...",
4,Disaster relief,"Flood, natural disaster, cyclones, earthquakes...","Flood, natural disaster, cyclones, earthquakes...",


In [52]:
ea_df['keywords_clean_words'] = get_cleaned_descriptions(list(ea_df[KEYWORDS_COLUMN]), True, True, True)
ea_df['keywords_clean'] = get_sentence_from_list(list(ea_df['keywords_clean_words']))

In [54]:
ea_df['keywords_clean_words'].head()

0    [hiv, aids, tuberculosis, clinic, hepatitis, d...
1    [deworming, parasitic, worm, neglect, tropical...
2    [social, entrepreneur, business, entrepreneurs...
3    [recycle, water, plastic, nature, fishery, far...
4    [flood, natural, disaster, cyclone, earthquake...
Name: keywords_clean_words, dtype: object

In [56]:
ea_df['keywords_clean'].head()

0    hiv aids tuberculosis clinic hepatitis dengue ...
1    deworming parasitic worm neglect tropical dise...
2    social entrepreneur business entrepreneurship ...
3    recycle water plastic nature fishery farm poll...
4    flood natural disaster cyclone earthquake reli...
Name: keywords_clean, dtype: object

## basic viz

In [91]:
from itertools import chain
web_cause_area_df = list(chain.from_iterable(list(web_df['cause_area_clean'].apply(lambda s: list(s)))))
web_cause_area_df = pd.Series(web_cause_area) 
web_cause_area_df = pd.DataFrame(web_cause_area_df.value_counts().reset_index())
web_cause_area_df.columns = ['cause area', 'count']

Q: top 20 cause areas with highest counts

In [92]:
web_cause_area_df[:20]

Unnamed: 0,cause area,count
0,religious,1085
1,education,463
2,social and welfare,392
3,health,281
4,children,251
5,charitable,242
6,others,232
7,environment,225
8,support groups,214
9,arts and heritage,147


Q: total number of unique cause areas

In [87]:
len(web_cause_area_df)

586

## create count vectors for each description