# EA x DKSG
**classification: ea**

classify cause area keywords -> cause areas
- expert knowledge & elbow grease from EA team

## methodology
- clean up cause keywords
    - drop stopwords
    - drop punctuations
- create count vectors for each cause area keyword set
- use count vectors as feature for classification with
    - decision tree
    
## hypothesis
given a cause area's list of keywords, i should be able to distinguish each cause area uniquely.

## caveats
since we only have one entry per cause area, the objective instead is to try to have a classifier that 

In [87]:
## setup
%run env_setup.py
%run filepaths.py
%run helpers.py

## clean up cause area keywords

In [124]:
## ml setup
import sklearn
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn import tree
from sklearn import ensemble
from sklearn.metrics import accuracy_score

In [89]:
ea_df = read_from_csv(EA_CSV)

In [90]:
ea_df['keywords_clean_words'] = get_cleaned_descriptions(list(ea_df[KEYWORDS_COLUMN]), True, True, False)
ea_df['keywords_clean'] = get_sentence_from_list(list(ea_df['keywords_clean_words']))

In [91]:
ea_df.head()

Unnamed: 0,Causes/ Columns,Keywords_Set 1,Keywords_Set 2,Yad's comments,keywords_clean_words,keywords_clean
0,Health infectious diseases,"HIV, AIDs, Tuberculosis, Clinic, Hepatitis, De...","HIV, AIDs, Tuberculosis, Hepatitis, Dengue, Ma...",,"[hiv, aids, tuberculosis, clinic, hepatitis, d...",hiv aids tuberculosis clinic hepatitis dengue ...
1,Neglected tropical diseases (NTDs),"Deworming, parasitic worms, neglected tropical...","Deworming, parasitic worms, neglected tropical...",,"[deworming, parasitic, worms, neglected, tropi...",deworming parasitic worms neglected tropical d...
2,Social Enterprise,"Social Entrepreneur, business, Entrepreneurshi...","Social Entrepreneur, Entrepreneurship",,"[social, entrepreneur, business, entrepreneurs...",social entrepreneur business entrepreneurship ...
3,Environment,"Recycle, Water, plastic, nature, fishery, farm...","Recycle, plastic, pollution, natural resources...",,"[recycle, water, plastic, nature, fishery, far...",recycle water plastic nature fishery farming p...
4,Disaster relief,"Flood, natural disaster, cyclones, earthquakes...","Flood, natural disaster, cyclones, earthquakes...",,"[flood, natural, disaster, cyclones, earthquak...",flood natural disaster cyclones earthquakes re...


Q: all cause areas

In [92]:
ea_df['Causes/ Columns']

0                  Health infectious diseases
1          Neglected tropical diseases (NTDs)
2                           Social Enterprise
3                                 Environment
4                             Disaster relief
5                         Housing and shelter
6                                       Clubs
7                               Special needs
8                     Health non communicable
9                             Family planning
10               Neonatal and maternal health
11                      Physical disabilities
12                                  Education
13                   Research and development
14                 Information and technology
15                    Energy & infrastructure
16                          Visual impairment
17                                    Elderly
18                                  Religious
19            Early childhood and development
20                          Children & Youths
21                             Ani

## create features: count vectors for each cause area

In [139]:
count_vectorizer = CountVectorizer(binary=False)

data_feat = count_vectorizer.fit_transform(ea_df['keywords_clean'])

## create labels: cause areas

In [140]:
lb = LabelBinarizer()
data_label = list(ea_df['Causes/ Columns']) 
data_label_transform = lb.fit_transform(data_label)

## classify and predict (basic)
- since classifier's performance varies by random state, iterate multiple times to see average model performance

In [141]:
def do_classify_dt(random_state=None):
    base_clf = tree.DecisionTreeClassifier(random_state=random_state)
    base_clf.fit(data_feat, data_label_transform)
    
    train_predict_feat = base_clf.predict(data_feat)
    train_predict = lb.inverse_transform(train_predict_feat)
    
    acc = accuracy_score(data_label, train_predict)
    
    return (base_clf, acc)

In [143]:
base_clf_random_state = range(50)
base_clf_experiments = [do_classify(i) for i in base_clf_random_state]

base_clf_avg_acc = np.average([acc for (clf,acc) in base_clf_experiments])
print('base classifier mean accuracy: %f' % base_clf_avg_acc)

base classifier mean accuracy: 0.966500


feature importances

In [151]:
import eli5
eli5.explain_weights(base_clf_experiments[0][0], top=-10, features=lb.classes_)

TypeError: explain_rf_feature_importance() got an unexpected keyword argument 'features'

In [154]:
count_vectorizer.vocabulary_

{'000': 0,
 '12': 1,
 'abandon': 2,
 'abandoned': 3,
 'abandonment': 4,
 'ability': 5,
 'abortion': 6,
 'absenteeism': 7,
 'absolute': 8,
 'abuction': 9,
 'abuse': 10,
 'abused': 11,
 'abusive': 12,
 'academic': 13,
 'accident': 14,
 'acid': 15,
 'activist': 16,
 'acute': 17,
 'addiction': 18,
 'adhd': 19,
 'administration': 20,
 'admission': 21,
 'adolescent': 22,
 'adoption': 23,
 'advertisement': 24,
 'advocacy': 25,
 'african': 26,
 'age': 27,
 'agencies': 28,
 'aggression': 29,
 'agricultural': 30,
 'agriculture': 31,
 'aid': 32,
 'aids': 33,
 'ailment': 34,
 'air': 35,
 'albendazole': 36,
 'alcohol': 37,
 'alms': 38,
 'almshouse': 39,
 'alzheimer': 40,
 'amnesty': 41,
 'anglican': 42,
 'animal': 43,
 'animals': 44,
 'anthropology': 45,
 'anti': 46,
 'architect': 47,
 'armed': 48,
 'art': 49,
 'artillery': 50,
 'arts': 51,
 'ascariasis': 52,
 'ashram': 53,
 'aspergerâ': 54,
 'assault': 55,
 'assessment': 56,
 'assistance': 57,
 'association': 58,
 'asylum': 59,
 'attendance': 60,
