# Project 2
## Matthew Manno's Data

I will use this data to figure out geographically where my users are. I want to know where they are located by city, and if city is unavailable then by the state they live in.

I Want to know:

    1. What city has the largest number of female users.
    2. What city has the largest number of male users.
    3. What state has the largest number of female users.
    4. What state has the largest number of male users.

In [34]:
import numpy as np
import pandas as pd
#set some pandas options controling output format
pd.set_option('display.notebook_repr_html',True) # output as flat text and not HTML
pd.set_option('display.max_rows', None) # this is the maximum number of rows we will display
pd.set_option('display.max_columns', None) # this is the maximum number of rows we will display

In [35]:
# read the csv into a dataframe, and force postal_code to be interpreted as a string.
users = pd.read_csv('data/IS-362_Untidy_Data.csv', dtype={'postal_code': str})

In [36]:
# Display the columns in the dataframe, and get a peak at how it's formatted
users.head()

Unnamed: 0,id,first_name,last_name,gender,email,address,city,state,postal_code,phone_number
0,1,Loralee,MacAdam,,lmacadam0@twitter.com,0122 Alpine Street,Fort Wayne,Indiana,46814.0,260-134-3027
1,2,Shir,Padula,Female,spadula1@state.gov,7 Orin Lane,Littleton,Colorado,,720-172-5365
2,3,Nicole,Ottey,Female,nottey2@ucsd.edu,04558 Del Mar Street,New York City,New York,10175.0,646-968-2745
3,4,Wandie,MacKinnon,Female,wmackinnon3@geocities.jp,4 Morningstar Alley,Tucson,Arizona,85743.0,520-484-3593
4,5,,Avey,Male,tavey4@google.com.br,40574 Mallard Crossing,Canton,Ohio,44720.0,234-345-6985


In [37]:
# get rid of rows without a city, state, and postal code
users.dropna(subset=['city', 'state', 'postal_code'], how='all', inplace=True)

In [38]:
# create a dataframe with the number of female users by State. Dropping NaN state values.
state_users = users.dropna(subset=['state'], how='any') \
                [(users.dropna(subset=['state'], how='any').gender == 'Female')] \
                .groupby('state') \
                .state.count() \
                .reset_index(name='female_count') \
                .sort_values(['state'], ascending=True)

# create a dataframe with the number of male users by State. Dropping NaN state values.
male_users = users.dropna(subset=['state'], how='any') \
                [(users.dropna(subset=['state'], how='any').gender == 'Male')] \
                .groupby('state') \
                .state.count() \
                .reset_index(name='male_count') \
                .sort_values(['state'], ascending=True)

# join the two series into one dataframe
state_users = state_users.join(male_users.male_count)

In [39]:
# return the state with the most female users
state_users.sort_values('female_count', ascending=False).head(1)

Unnamed: 0,state,female_count,male_count
4,California,57,39


In [40]:
# return the state with the most male users
state_users.sort_values('male_count', ascending=False).head(1)

Unnamed: 0,state,female_count,male_count
4,California,57,39


In [41]:
# create a dataframe with the number of female users by city
city_users = users.dropna(subset=['city'], how='any') \
                [(users.dropna(subset=['city'], how='any').gender == 'Female')] \
                .groupby(['city', 'state']) \
                .city.count() \
                .reset_index(name='female_count') \
                .sort_values(['city','state'], ascending=[True,True])

# create a dataframe with the number of male users by city
male_users = users.dropna(subset=['city'], how='any') \
                [(users.dropna(subset=['city'], how='any').gender == 'Male')] \
                .groupby(['city','state']) \
                .city.count() \
                .reset_index(name='male_count') \
                .sort_values(['city', 'state'], ascending=[True,True])
                
# join the two series into one dataframe
city_users = city_users.join(male_users.male_count)
city_users.male_count = male_users.male_count.astype(int)

In [42]:
# return the city with the most female users
city_users.sort_values('female_count', ascending=False).head(1)

Unnamed: 0,city,state,female_count,male_count
203,Washington,District of Columbia,20,


In [43]:
# return the city with the most male users
city_users.sort_values('male_count', ascending=False).head(1)

Unnamed: 0,city,state,female_count,male_count
188,Tampa,Florida,2,18.0


## Matthew Manno's Answers
to answer my questions:
    1. What city has the largest number of female users.
                Washington, DC
    2. What city has the largest number of male users.
                Tampa, FL
    3. What state has the largest number of female users.
                California
    4. What state has the largest number of male users.
                California

# Djamshed Djuraev's Data
Djamshed's post:
    I have found the following CSV formatted data which is called "Death_Rates_NYC". Here is the CSV file: Death_Rates_NYC.csv 

    Here is what I will be analyzing by using pandas:

    1. There is dot's exist in the some rows which is a missing data. I will change the dot's to null and will sort the data by Year and Leading Cause.
    2. I will see all causes of death among females of white non Hispanic Race Ethnicity.
    3. I will filter data by grouping by Race Ethnicity and sort the data by Race Ethnicity and Leading Cause.

In [44]:
# read the csv into a dataframe, and set all "." to na values
dot = pd.read_csv('data/Death_Rates_NYC.csv', na_values=['.'])

In [45]:
dot.head(5)

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
0,2014,Diabetes Mellitus (E10-E14),F,Other Race/ Ethnicity,11.0,,
1,2011,Cerebrovascular Disease (Stroke: I60-I69),M,White Non-Hispanic,290.0,21.7,18.2
2,2008,Malignant Neoplasms (Cancer: C00-C97),M,Not Stated/Unknown,60.0,,
3,2010,Malignant Neoplasms (Cancer: C00-C97),F,Hispanic,1045.0,85.9,98.5
4,2012,Cerebrovascular Disease (Stroke: I60-I69),M,Black Non-Hispanic,170.0,19.9,23.3


In [46]:
# rename the columns for ease of use
dot = dot.rename(columns={'Year': 'year', \
                          'Leading Cause': 'leading_cause', \
                          'Sex': 'sex', \
                          'Race Ethnicity': 'race_ethnicity', \
                          'Deaths': 'deaths', \
                          'Death Rate': 'death_rate', \
                          'Age Adjusted Death Rate': 'age_adjusted_death_rate'})

# this line drops any rows with null for Year, Leading Cause, or Deaths.
dot.dropna(subset=['year', 'leading_cause', 'deaths'], how='any', inplace=True) 

# set the deaths column to integers so pandas interprets it numerically
dot.deaths = dot.deaths.astype(int)

In [47]:
# Sort the dataframe by Year and Leading Cause... which means total deaths for a leading cause.
dot.sort_values(['year', 'deaths'], ascending=[True, False], inplace=True)

In [48]:
# Filter the dataframe to show only deaths for White Non-Hispanic females.
f_wnh_deaths = dot[(dot.race_ethnicity == 'White Non-Hispanic') & (dot.sex == 'F')]
            
f_wnh_deaths.head(50)

Unnamed: 0,year,leading_cause,sex,race_ethnicity,deaths,death_rate,age_adjusted_death_rate
234,2007,All Other Causes,F,White Non-Hispanic,1680,117.1,81.5
181,2007,"Accidents Except Drug Posioning (V01-X39, X43,...",F,White Non-Hispanic,162,11.3,7.3
230,2008,All Other Causes,F,White Non-Hispanic,1706,118.9,78.6
38,2008,Diabetes Mellitus (E10-E14),F,White Non-Hispanic,210,14.6,8.8
190,2009,Malignant Neoplasms (Cancer: C00-C97),F,White Non-Hispanic,3346,232.9,159.0
119,2010,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,White Non-Hispanic,5351,374.2,189.2
224,2010,Chronic Lower Respiratory Diseases (J40-J47),F,White Non-Hispanic,501,35.0,20.7
133,2010,Essential Hypertension and Renal Diseases (I10...,F,White Non-Hispanic,219,15.3,7.7
195,2011,Malignant Neoplasms (Cancer: C00-C97),F,White Non-Hispanic,3371,238.0,161.1
165,2011,Essential Hypertension and Renal Diseases (I10...,F,White Non-Hispanic,199,14.0,6.8


In [49]:
# I will filter data by grouping by Race Ethnicity and sort the data by Race Ethnicity and Leading Cause.
# I think what he is trying to say is that he will sort the data in such a way that he will be able
# to discren the leading cause of death by race/ethnicity. To do this we furst have to create a 
# dataframe that holds the race/ethnicity, leading cause of death, and sum of those deaths from each year.
# this line of code does that.
leading_cause_by_race = dot.groupby(['race_ethnicity', 'leading_cause'], \
                        as_index=False)['deaths'].sum()

In [50]:
# Then we take the dataframe with the race/ethnicity, leading cause of death, and sum of those 
# deaths from each year, and sort it by race/ethnicity, and the sum of the Leading cause of death.
leading_cause_by_race.sort_values(['race_ethnicity', 'deaths'], ascending=[True,False])

Unnamed: 0,race_ethnicity,leading_cause,deaths
11,Asian and Pacific Islander,Malignant Neoplasms (Cancer: C00-C97),3263
1,Asian and Pacific Islander,All Other Causes,1743
7,Asian and Pacific Islander,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",1730
9,Asian and Pacific Islander,Influenza (Flu) and Pneumonia (J09-J18),429
2,Asian and Pacific Islander,Cerebrovascular Disease (Stroke: I60-I69),391
5,Asian and Pacific Islander,Chronic Lower Respiratory Diseases (J40-J47),286
0,Asian and Pacific Islander,"Accidents Except Drug Posioning (V01-X39, X43,...",168
6,Asian and Pacific Islander,Diabetes Mellitus (E10-E14),128
8,Asian and Pacific Islander,Essential Hypertension and Renal Diseases (I10...,92
10,Asian and Pacific Islander,"Intentional Self-Harm (Suicide: X60-X84, Y87.0)",58


# Nicholas Ileczko 's Data
Nicholas's post:
    This dataset contains traffic violation information from all electronic traffic violations issued in the County.
    https://catalog.data.gov/dataset/traffic-violations-56dda

    1. I will analyze the data by finding out which type of car has the most violations
    2. Which state besides Maryland has the highest violations
    3. Which violation is most popular

The dataset provided was over 400 MB, so I truncated is significantly to only the first 1000 rows, for the purpsoses of this assignment.

In [65]:
# dot = pd.read_csv('data/Death_Rates_NYC.csv', na_values=['.'])
traffic = pd.read_csv('data/Traffic_Violations_Short.csv', dtype={'Year': str})

In [75]:
# create a series that contains the year, make, and model in one string
traffic.add(ymm, traffic['Year'].astype(str) + ' ' + traffic['Make'] + ' ' + traffic['Model'])
traffic.head(5)

TypeError: 'Series' objects are mutable, thus they cannot be hashed

In [71]:
# count and group the year/make/model string to 
ymm.groupby(['ymm']).ymm.count()

city_users = users.dropna(subset=['city'], how='any') \
                [(users.dropna(subset=['city'], how='any').gender == 'Female')] \
                .groupby(['city', 'state']) \
                .city.count() \

KeyError: 'ymm'