In [7]:
# the following line is jupyter notebook specific 
%matplotlib inline

# Python Dataset Assigment 2

## The Dataset
The dataset is a file containing data about homocides in the period from 1980-2014 with data about the victim along with his case, including gender, age, ethnicity, perpetrator and so on.
## The Task
1. Which ethnicity is it most common for the victims and perpetrators to be?
2. Which weapon is most used by men?
3. Which weapon is most used by women?
4. What is the age of the youngest victim and the oldest victim?
5. Average age of victims?
6. Male to female ratio of perpetrators?
7. Top 10 states with most homicides? display it with bars (barchart) or similar

Optionals:
8. ~~Are younger perpetrators (age 15-25) more likely to get caught then older ones (25+)?~~

We have done all the above except for the ones that are striked.


## Question 1
**Which ethnicity is it most common for the victims and perpetrators to be?**

We used dictionaries for this. We import the defaultdict from collections because it lets us set a default value for unknown entries. We load the CSV file, create dictionaries, and insert into them while iterating.

We first tried iterating with *iterrows*, but it was **orders of magnitude slower** than *itertuples*.

The most popular victim and perpetrator ethnicity seems to be **Unknown**. We thought this was an error at first, but if we print out the dictionaries that we counted, we can see that it is not an error.

In [12]:
from collections import defaultdict
import pandas as pd

data = pd.read_csv("./database.csv", dtype={"Perpetrator Age": object})

victims = defaultdict(lambda: 0)
perpetrators = defaultdict(lambda: 0)

for row in data.itertuples():       # itertuples gave order of magnitude improvement over iterrows
    victims[row[15]] += 1
    perpetrators[row[19]] += 1
    
eth_victim = max(victims, key=victims.get)
eth_perpetrator = max(perpetrators, key=perpetrators.get)
print("Most popular victim ethnicity: " + eth_victim)
print(dict(victims))
print("")
print("Most popular perpetrator ethnicity: " + eth_perpetrator)
print(dict(perpetrators))

Most popular victim ethnicity: Unknown
{'Hispanic': 72652, 'Unknown': 368303, 'Not Hispanic': 197499}

Most popular perpetrator ethnicity: Unknown
{'Hispanic': 46872, 'Unknown': 446410, 'Not Hispanic': 145172}


## Question 2
Which weapon is most used by men?

Same as before, we make a simple counting dictionary.

In [13]:
male_weapons = defaultdict(lambda: 0)
for row in data.itertuples():
    if(row[16] == "Male"):
        male_weapons[row[21]] += 1
top_male_weapon = max(male_weapons, key=male_weapons.get)
print("Most popular male weapon: " + top_male_weapon)

Most popular male weapon: Handgun


## Question 3
Which weapon is most used by women?

First we import pandas as pd and numpy as np.
Then we read the csv file and set Perpetrator Ages dtype as object since it has a data mismatch.

we set the csv file as a specialized 2d-array with matrix() and call it dd.

we specifically take Female from column 15 with np.unique and counts it with the weapon column(20) and then print out the  weapon used the mostb with np.argmax.

In [1]:
import pandas as pd
import numpy as np
# import webget

homicides = pd.read_csv("database.csv", dtype={"Perpetrator Age": object})

dd = homicides.as_matrix()

# Weapon = 20
# Perpetrator sex = 15

weapons, count = np.unique(dd[(dd[:,15] == "Female")][:,20], return_counts=True)
print("Most used weapon by females: " + weapons[np.argmax(count)])

Most used weapon by females: Handgun


## Question 4
What is the age of the youngest victim and the oldest victim?

This was pretty simple since we just had to use column 12, so we uniquely specified column 12 two times to get both the max value and the min value from the column.

In [2]:
#What is the age of the youngest victim and the oldest victim?

young_victim = np.unique(dd[:,12])
    
old_victim = np.unique(dd[:,12])

print("Youngest victim: ", np.argmin(young_victim), "years old")
print("Oldest victim: ", np.argmax(old_victim), "years old")

Youngest victim:  0 years old
Oldest victim:  100 years old


## Question 5
Average age of victims?

In order to find the average age of the victims, we created 2 variables: 1 for the sum of all ages and 1 for the count of them.
When we've got these 2 numbers, it's fairly simple to just divide the sum with the count, and thereby have the average


In [5]:
#data = read_from_csv("./database.csv")


def ex5_average_victim_age(data):
    age_sum = 0
    counting = 0
    
    for row in data.iterrows():
       if(row[13] == "Victim Age"):
                pass # ignore headers
       else:
          age_sum += int(row[13])
          counting += 1
    average_victim_age = float(age_sum / counting)
    
    print("The average victim age is " + average_victim_age + ".")

## Question 6
Male to female ratio of perpetrators?

Much like the previous question; we instantiate some variables. As we iterate over the entire row that holds "Victim Sex", we check their gender and count up for their respective variable. When we've aquired these 2 numbers, then it's only a matter of finding the percentage.  

In [None]:
def ex6_gender_ratio_attackers(data):
    
    male_count = 0
    female_count = 0
    unknown_count = 0
    total_count = 0
    
    for row in data.iterrows():
        if (row[12] == "Victim Sex):
            pass #ignore headers
        elif(row[12] == "Male"):
            male_count += 1
            total_count += 1
        elif(row[12] == "Female"):
            female_count += 1
            total_count += 1
     
    male_ratio = (male_count / total_count) * 100      
    female_ratio = (female_count / total_count) * 100
            
    print("Males made up: " + male_ratio + "% /n," + 
          "females made up: " + female_ratio" + %.")
    

## Question 7
Top 10 states with most homicides? display it with bars (barchart) or similar 

First we need to take care of the importing of the Python modules that we are going to be using, they are as following:

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import heapq

We will be using [pandas](http://pandas.pydata.org/) to read all the data from a csv file and prepare it for data handling as a dataframe object. Next up is [pyplot](http://matplotlib.org/users/pyplot_tutorial.html), which is going to be used to present the data as a bar chart. Next, heapq, which is a module that gives us access to things like the [heapsort](https://en.wikipedia.org/wiki/Heapsort) algorithm. 

The second step is to download the dataset from a target url. We simulate that in this case, because the dataset had to retrieved manually:

In [10]:
def download(url):
    return "./database.csv"

The third step is to read the dataset and prepare it for data handling, we use pandas here

In [11]:
def read_from_csv(filename):
    data = pd.read_csv(filename)
    return data

Now that we have a dataframe object, called data, we can start applying data handling to get the results we want. Remember we are interested in the top 10 states by homicide rates!
First of all we define a new function as ex7_state_most_homicides, which takes the data as parameter and something we have invented called a limiter. The limiter is used if you only want to take so many rows from the huge data set, for testing purposes.

In [11]:
def ex7_states_most_homicides(data, limiter):
    if limiter > len(data):
        raise ValueError('Limiter value higher than data in dataset!')
    states = {}
    #for idx, row in data.iterrows():
        #if row["State"] in states:
            #states[row["State"]] += 1
        #else:
            #states.setdefault(row["State"], 0)
        #if idx > limiter:
            #break
    top_10_states = {}
    #Return the top 6 elements in our list. nlargest takes n = elements, iterable = Publishers and opt key which we need here 
    #list_of_states = heapq.nlargest(10, states, key=states.get) 
    #for state in list_of_states:
        #top_10_states[state] = states.get(state)
    return top_10_states

In the above code pay attention to all lines that are not comments (has # infront and is not grey). We use our limiter to determine if it is set to a value bigger than the total rows of the data. If that is the case, we have someone who is trying to use our method incorrectly. We throw an error telling him that.
Then we prepare 2 dictionaries, states and top_10_states and we return the latter

Now we are ready to fetch the data we need:

In [12]:
def ex7_states_most_homicides(data, limiter):
    if limiter > len(data):
        raise ValueError('Limiter value higher than data in dataset!')
    states = {}
    for idx, row in data.iterrows():
        if row["State"] in states:
            states[row["State"]] += 1
        else:
            states.setdefault(row["State"], 1)
        if idx > limiter:
            break
    top_10_states = {}
    #Return the top 6 elements in our list. nlargest takes n = elements, iterable = Publishers and opt key which we need here 
    #list_of_states = heapq.nlargest(10, states, key=states.get) 
    #for state in list_of_states:
        #top_10_states[state] = states.get(state)
    return top_10_states

For every index and row in our data, we ask if the name of the row we are on exists in the dictionary we prepared "states", if that is the case, we do not need to add a new entry, but simply count the number of entries by saying, hey, here is an occurence so let's add 1! 
If it does not exist (else) we create a new entry and set it to a default of 1.

For the final step we do this:

In [14]:
def ex7_states_most_homicides(data, limiter):
    if limiter > len(data):
        raise ValueError('Limiter value higher than data in dataset!')
    states = {}
    for idx, row in data.iterrows():
        if row["State"] in states:
            states[row["State"]] += 1
        else:
            states.setdefault(row["State"], 1)
        if idx > limiter:
            break
    top_10_states = {}
    #Return the top 6 elements in our list. nlargest takes n = elements, iterable = Publishers and opt key which we need here 
    list_of_states = heapq.nlargest(10, states, key=states.get) 
    for state in list_of_states:
        top_10_states[state] = states.get(state)
    return top_10_states

We use the module we imported heapq, to find the n largest elements in our dictionairy and return the names instead of the values. We now get a dictionary of the top 10 states by name.
Next, we need to apply a value to the dict keys so we iterate over every state and assign its value

Finally, return the dictionairy

The final step is to plot our data that we handled:

In [15]:
def ex7_plot(tt_values, tt_labels):
    plt.bar(range(len(tt_values)), tt_values, width=0.5, linewidth=0, align='center') # !IMPORTANT
    plt.xticks(range(len(tt_values)), tt_labels, size='small')
    title = 'Top 10 states with highest recorded homocide rates'
    plt.title(title, fontsize=20)
    plt.xlabel("State", fontsize=12)
    plt.ylabel("No. of homocide cases", fontsize=12)
    plt.tick_params(axis='x', which='major', labelsize=5)
    plt.show()

We define a function that will takes labels and values. It will then bar plot those values. The important line is the first one, plt.bar .... where we specify our valies for the x and y axis

So how do we use our functions? Like this:

In [None]:
filename = download("dummy_url")
data = read_from_csv(filename)
top_ten_states = ex7_states_most_homicides(data, len(data)) # !IMPORTANT
tt_labels = list(top_ten_states.keys()) # !IMPORTANT
tt_values = list(top_ten_states.values()) # !IMPORTANT
ex7_plot(tt_values, tt_labels)

We need to prepare the data dictionary as a list, so we can fetch its keys and values separately and send them to our plotter function

## The Result

![alt text](https://github.com/DaMexicanJustice/frantic_midnight/blob/master/handin2/ex7_plot.png?raw=true "Logo Title Text 1")