## Project name: Flag Database Fun Time

### Team Leader: Cole Polychronis

### Team Members: Merritt Ruthrauff, Kelsey Henrichsen, Jasmine Boonyakiti

#### Data Description:

Our datasets has 194 records. Each record is described by these values:

**Response Variable：**
* Religion of a Country      

**Predictor Variables:**
1. Name of country   
2. Continent of country                  
3. Quadrant of the world (relative to Greenwich and the Equator)   
4. Area (in millions of square kilometers)                           
5. Population (in millions) 
6. Language spoken  
7. Number of vertical bars on flag              
8. Number of horizontal stripes on flag         
9. Number of colors on flag    
10. Presence of red on flag
11. Presence of green on flag
12. Presence of blue on flag
13. Presence of gold on flag
14. Presence of white on flag
15. Presence of black on flag
16. Presence of orange on flag
17. Main color on flag
18. Number of circles on flag
19. Number of crosses on flag
20. Number of saltires (diagonal crosses) on flag
21. Number of quartered sections
22. Number of suns or stars on flag
23. Presence of crescent on flag
24. Presence of triangles on flag
25. Presence of inanimate icon on flag
26. Presence of animate icon on flag
27. Presence of text on flag
28. Color in top left of flag
29. Color in bottom right of flag

# Step 1.1: Load Data

#### Because our data could not be downloaded from the UCI site, we created a very basic webcrawler to pull the dataset off of the UCI page. Since this gave us a list where some attributes were connected by \n, we built a function to seperate these connected values. We then grouped these attributes together into the correct number of rows and converted them into a dataframe.

In [None]:
# basic webcrawler to peel data off of UCI webpage
from lxml import html
import requests
page = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data')
tree = html.fromstring(page.text)
info = tree.xpath('//text()')
info = info[0].split(',')

# method to untangle values connected by \n (EX: 'green\nAlbania')
def untangle(arr):
    untangled = []
    for el in arr:
        try: 
            ind = el.index('\n')
        except ValueError:
            ind = -1
        if ind == -1:
            untangled.append(el)
        else:
            untangled.append(el[:ind])
            untangled.append(el[ind+1:])
    return untangled

# group data into rows and convert to a Dataframe
untangled = untangle(info)
usable = [untangled[i:i + 30] for i in range(0, len(untangled), 30)]
import pandas as pd
df = pd.DataFrame(usable)

# remove last buffer row, which contains all None
df = df[:-1]
# convert all columns that should be numeric to ints (as they are currently all strings)
indices = []
for i in range(29):
    if i != 0 and i != 17 and i != 28 and i != 29:
        indices.append(i)
df[indices] = df[indices].apply(pd.to_numeric)

# Step 1.2: Preprocessing
#### Mention creation of dummies to work with DecisionTreeClassifier and addition of column names so that the graphviz output of our tree makes more sense.

In [None]:
# convert categorical variables to "dummy" variables so that they can be used by the DecisionTreeClassifier
df = pd.get_dummies(df, prefix=['continent', 'quadrant', 'language','mainColor','topLeftColor','bottomRightColor'], columns=[1, 2, 5, 17, 28, 29])

In [None]:
# select data and target sets
data = df.drop(df.columns[[0,6]], axis=1)
target = df[6]

# df.columns=['country','continent','quadrant','area (thousands of square km)','population (millions)','language','religion','bars','stripes','numOfColors','red','green','blue','gold','white','black','orange','mainColor','numOfCircles','numOfCrosses','numOfSaltires','numOfQuarters','numOfSunStars','crescent','triangle','icon','animate','text','topLeftColor','botRightColor']

df.columns=['country','area (thousands of square of km)', 'population (millions)','religion','bars','stripes','numOfColors','red','green','blue','gold','white','black','orange','numOfCircles','numOfCrosses','numOfSaltires','numOfQuarters','numOfSunStars','crescent','triangle','icon','animate','text','inN.America','inS.America','inEurope','inAfrica','inAsia','inOceania','inNE','inSE','inSW','inNW','english','spanish','french','german','slavic','otherIndoEuropean','chinese','arabic','Japanese/Turkish/Finnish/Magyar','other',
            'mainColor_black','mainColor_blue','mainColor_brown','mainColor_gold','mainColor_green','mainColor_orange','mainColor_red','mainColor_white','topLeftColor_black','topLeftColor_blue','topLeftColor_gold','topLeftColor_green','topLeftColor_orange','topLeftColor_red','topLeftColor_white','bottomRightColor_black',
            'bottomRightColor_blue','bottomRightColor_brown',
            'bottomRightColor_gold','bottomRightColor_green',
            'bottomRightColor_orange','bottomRightColor_red',
            'bottomRightColor_white']
df.head()


# Step 1.3: Modeling using Decision Tree
#### Mention that from the output given, population size looks like the most important feature for determining religion of a country.

In [None]:
# create Decision Tree Classifier and determine which features are most important
from sklearn import tree
import numpy as np
clf = tree.DecisionTreeClassifier()
clf.fit(data,target)

y_pred = clf.predict(data)
classif_rate = np.mean(y_pred.ravel() == target.ravel()) * 100
print("classif_rate for %s : %f " % ('RandomForestClassifier', classif_rate))
print clf.feature_importances_

# Step 1.4: Visualization
#### Mention that for categorical variables like inS.America, the value being <=0.5 means the value was 0, which means not true, where a value >0 means the value was 1, which means true.

In [None]:
# visualize the decision tree
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None,
#                                max_depth=5,
                               filled=True, rounded=True,
                               feature_names=df.columns,
                               class_names=['Catholic','Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Other']) 
graph = graphviz.Source(dot_data)  
graph