# EA x DKSG
**classification: ea**

classify cause area keywords -> cause areas
- expert knowledge & elbow grease from EA team

## methodology
- clean up cause keywords
    - drop stopwords
    - drop punctuations
- create count vectors for each cause area keyword set
- use count vectors as feature for classification with
    - decision tree
    
## hypothesis
given a cause area's list of keywords, i should be able to distinguish each cause area uniquely.

## goal
1. create a model that can distinguish each cause area based on cause area keywords
2. capture related concepts based on existing keywords in each cause area
3. do 1 and 2 with as simple a model as possible (for interpretability)

In [1]:
## setup
%run env_setup.py
%run filepaths.py
%run helpers.py

## clean up cause area keywords

In [2]:
## ml setup
import sklearn
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn import tree
from sklearn import ensemble
from sklearn.metrics import accuracy_score

In [3]:
ea_df = read_from_csv(EA_CSV)

In [4]:
ea_df['keywords_clean_words'] = get_cleaned_descriptions(list(ea_df[KEYWORDS_COLUMN]), True, True, False)
ea_df['keywords_clean'] = get_sentence_from_list(list(ea_df['keywords_clean_words']))

In [5]:
ea_df.head()

Unnamed: 0,Causes/ Columns,Keywords_Set 1,Keywords_Set 2,Yad's comments,keywords_clean_words,keywords_clean
0,Health infectious diseases,"HIV, AIDs, Tuberculosis, Clinic, Hepatitis, De...","HIV, AIDs, Tuberculosis, Hepatitis, Dengue, Ma...",,"[hiv, aids, tuberculosis, clinic, hepatitis, d...",hiv aids tuberculosis clinic hepatitis dengue ...
1,Neglected tropical diseases (NTDs),"Deworming, parasitic worms, neglected tropical...","Deworming, parasitic worms, neglected tropical...",,"[deworming, parasitic, worms, neglected, tropi...",deworming parasitic worms neglected tropical d...
2,Social Enterprise,"Social Entrepreneur, business, Entrepreneurshi...","Social Entrepreneur, Entrepreneurship",,"[social, entrepreneur, business, entrepreneurs...",social entrepreneur business entrepreneurship ...
3,Environment,"Recycle, Water, plastic, nature, fishery, farm...","Recycle, plastic, pollution, natural resources...",,"[recycle, water, plastic, nature, fishery, far...",recycle water plastic nature fishery farming p...
4,Disaster relief,"Flood, natural disaster, cyclones, earthquakes...","Flood, natural disaster, cyclones, earthquakes...",,"[flood, natural, disaster, cyclones, earthquak...",flood natural disaster cyclones earthquakes re...


Q: all cause areas

In [6]:
ea_df['Causes/ Columns']

0                  Health infectious diseases
1          Neglected tropical diseases (NTDs)
2                           Social Enterprise
3                                 Environment
4                             Disaster relief
5                         Housing and shelter
6                                       Clubs
7                               Special needs
8                     Health non communicable
9                             Family planning
10               Neonatal and maternal health
11                      Physical disabilities
12                                  Education
13                   Research and development
14                 Information and technology
15                    Energy & infrastructure
16                          Visual impairment
17                                    Elderly
18                                  Religious
19            Early childhood and development
20                          Children & Youths
21                             Ani

## create features: count vectors for each cause area

In [7]:
count_vectorizer = CountVectorizer(binary=False)

data_feat = count_vectorizer.fit_transform(ea_df['keywords_clean'])

## create labels: cause areas

In [8]:
lb = LabelBinarizer()
data_label = list(ea_df['Causes/ Columns']) 
data_label_transform = lb.fit_transform(data_label)

## classify and predict (basic)
- since classifier's performance varies by random state, iterate multiple times to see average model performance

In [9]:
def do_classify_dt(random_state=None):
    base_clf = tree.DecisionTreeClassifier(random_state=random_state)
    base_clf.fit(data_feat, data_label_transform)
    
    train_predict_feat = base_clf.predict(data_feat)
    train_predict = lb.inverse_transform(train_predict_feat)
    
    acc = accuracy_score(data_label, train_predict)
    
    return (base_clf, acc)

In [10]:
base_clf_random_state = range(50)
base_clf_experiments = [do_classify_dt(i) for i in base_clf_random_state]

base_clf_avg_acc = np.average([acc for (clf,acc) in base_clf_experiments])
print('base classifier mean accuracy: %f' % base_clf_avg_acc)

base classifier mean accuracy: 1.000000
