## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Edward Day
    - Email: ED558@drexel.edu
- Group member 2
    - Name:Sophia Lee
    - Email: sl3385@drexel.edu
- Group member 3
    - Name: Sahar Siddiqi
    - Email: ss5226@drexel.edu
- Group member 4
    - Name: NA
    - Email: NA

### Additional submission comments
- Tutoring support received: Jacob Rosen jkr58@drexel.edu
- Other (other): Sophia Lee

# Assignment group 1: Textual feature extraction and numerical comparison

## Module A _(35 points)_ Processing numeric data

__A1.__ _(3 points)_ In this problem, you will be working with the demographics data from The Henry J. Kaiser Family Foundation (https://www.kff.org/) including the population of 52 locations (50 states, District of Columbia, and Puerto Rico) based on race, gender, age, and the number of adults with and without children. This data is obtained from the Census Bureau’s American Community Survey (ACS). The data is stored in a `csv` file format located in the attached `data` directory. Read the `data/demographics.csv` file into a `pandas` dataframe and print the dimensionality (using `.shape` method) and the name of the features of the dataframe. 

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("./data/demographics.csv" , sep = ",")

data.shape

(52, 18)

__A2.__ _(2 points)_ To gain a better insight into the dataframe, show the head and tail of the dataframe (using the `.head()` and `.tail()` methods).

In [4]:
data.head()
data.tail()

Unnamed: 0,Location,Male,Female,Adults_with_Children,Adults_with_No_Children,White,Black,Hispanic,Asian,American_Indian_or_Alaska_Native,Native_Hawaiian_or_Other_Pacific_Islander,Two_Or_More_Races,Age0_18,Age_19_25,Age_26_34,Age_35_54,Age_55_64,Age_65_plus
47,Washington,3589700,3650100,1522400,2920700,4976100,246900.0,916000,617900.0,68600.0,42200.0,372200.0,1701700,628700,965900.0,1899400,949100,1095000
48,West Virginia,863400,898700,322100,714100,1631600,64300.0,20400,14200.0,2000.0,,28800.0,385000,148600,178300.0,448700,260600,340900
49,Wisconsin,2792400,2850600,1161900,2230900,4600500,328100.0,382000,166800.0,43500.0,,120600.0,1319600,495600,643300.0,1441400,812500,930500
50,Wyoming,286600,276700,118200,214300,474400,5300.0,56100,4200.0,12100.0,,11200.0,142400,48500,69400.0,133100,81500,88300
51,Puerto Rico,1570300,1734500,543200,1406500,23000,,3274500,,,,,701800,327600,344300.0,849600,428200,653200


__A3.__ _(5 points)_ As you can see, there is no `total` population column in this dataframe for each location. Therefore, create a new column which shows the summation of `Male` and `Female` populations. 

The `pandas` package provides a `.describe()` method which gives a descriptive summary of the desired column(s) (such as mean, standard deviation, min, and max values). Print these statistics for the column `total` which you just created. Then, compare the average population of locations obtained from `.describe()` method with the output of the `mean` method of `numpy` package. Are they the same?

In [6]:
total = data['Male']+ data['Female']
data['Total'] =  total

print(data['Total'].describe())
print('NumPy Mean:', np.mean(data['Total']))

if data['Total'].describe()[1]==(np.mean(data['Total'])):
    print('Averages match')
else:
    print('Averages do not match')

count    5.200000e+01
mean     6.160138e+06
std      7.092525e+06
min      5.633000e+05
25%      1.742100e+06
50%      4.185050e+06
75%      6.940925e+06
max      3.871380e+07
Name: Total, dtype: float64
NumPy Mean: 6160138.461538462
Averages match


__A4:__ _(5 points)_ Find the locations with the minimum and maximum populations.

In [8]:
print('Minium population:', data[data['Total']==data['Total'].min()]['Location'])

print('Maximum population: ', data[data['Total']==data['Total'].max()]['Location'])

Minium population: 50    Wyoming
Name: Location, dtype: object
Maximum population:  4    California
Name: Location, dtype: object


__A5.__ _(5 points)_ In this part, we are looking at two columns, `Adults_with_Children` and `Adults_with_No_Children`. 
It seems that the populations in these two columns do not include the children (aged younger than 18 years), and older adults (aged older than 64 years). Confirm this hypothesis for `Pennsylvania`, `Colorado`, and `Georgia` that summation of these two columns equal the summation of the population of all age-groups when two `Age0_18` and `Age_65_plus` columns are excluded (For doing that, you may create a new column `total_adults_age_groups` and compare that with the summation of `Adults_with_Children` and `Adults_with_No_Children`). 

In [10]:
data['total_adults_age_groups'] = data['Age_19_25']+data['Age_26_34']+data['Age_35_54']+data['Age_55_64']
match = (data['total_adults_age_groups'] == (data['Adults_with_Children']+data['Adults_with_No_Children'])) 

match[data["Location"].isin(['Pennsylvania', 'Colorado','Georgia'])]

5     True
10    True
38    True
dtype: bool

__A6:__ _(2 points)_ It seems that our hypothesis is correct for these three states. To make sure, we need to confirm that the differences between total_adults_age_groups column and summation of Adults_with_Children and Adults_with_No_Children for all the locations are zero. You can do that by looking in all the states' differences. Instead of that, create a logical rule (boolean mask) to return `False` if there is at least one non-zero value in the difference between `total_adults_age_groups` column and summation of `Adults_with_Children` and `Adults_with_No_Children`.

In [12]:

match = np.all(data['total_adults_age_groups'] == (data['Adults_with_Children']+data['Adults_with_No_Children']))
print(match)

False


__A7:__ _(3 points)_ It seems that our hypothesis is not correct for all the locations. We need to know what are the locations and the exact differences to see if it is due to rounding or there is a really significant difference. Create a dictionary and store locations (as `keys`) and the amount of difference between the population of adults obtained using the summation of two `Adults_with_Children` and `Adults_with_No_Children` columns and those obtained using the population of `Age_19_25`, `Age_26_34`, `Age_35_54`, `Age_55_64` age-groups (as `values`). 

In [14]:
offset_values = {row['Location']: row['total_adults_age_groups'] - (row['Adults_with_Children']+row['Adults_with_No_Children']) for index, row in data.iterrows() }
print(offset_values)

{'Alabama': -100.0, 'Alaska': 0.0, 'Arizona': -100.0, 'Arkansas': -100.0, 'California': -100.0, 'Colorado': 0.0, 'Connecticut': 0.0, 'Delaware': -100.0, 'District of Columbia': 0.0, 'Florida': -100.0, 'Georgia': 0.0, 'Hawaii': 0.0, 'Idaho': 100.0, 'Illinois': 100.0, 'Indiana': 0.0, 'Iowa': 100.0, 'Kansas': -100.0, 'Kentucky': -100.0, 'Louisiana': -100.0, 'Maine': 0.0, 'Maryland': 0.0, 'Massachusetts': 0.0, 'Michigan': 0.0, 'Minnesota': -100.0, 'Mississippi': 0.0, 'Missouri': 0.0, 'Montana': 100.0, 'Nebraska': -100.0, 'Nevada': 0.0, 'New Hampshire': 0.0, 'New Jersey': 0.0, 'New Mexico': 100.0, 'New York': 0.0, 'North Carolina': -200.0, 'North Dakota': -100.0, 'Ohio': 0.0, 'Oklahoma': 100.0, 'Oregon': 100.0, 'Pennsylvania': 0.0, 'Rhode Island': 0.0, 'South Carolina': -100.0, 'South Dakota': 0.0, 'Tennessee': 0.0, 'Texas': 0.0, 'Utah': -100.0, 'Vermont': 100.0, 'Virginia': -200.0, 'Washington': 0.0, 'West Virginia': 0.0, 'Wisconsin': 0.0, 'Wyoming': 0.0, 'Puerto Rico': 0.0}


__A8.__ _(7 points)_ In this part, we are going to find the similarity of locations based on their races population distributions using _cosine similarity_. In cases where there is no population of one race for a location in the dataframe (the corresponding value is `NaN`), replace them with `zero` using the `.fillna()` method of `pandas`. Then, create a list and append to that each pair of locations with their similarity as a tuple, like: `(loc1, loc2, similarity value)`.


In [17]:
Races=['White', 'Black', 'Hispanic', 'Asian',
       'American_Indian_or_Alaska_Native',
       'Native_Hawaiian_or_Other_Pacific_Islander', 'Two_Or_More_Races']
for race in Races:
    data[race].fillna(0,inplace= True)

    
cosines = []
for loc1 in range(0,data.shape[0]-1):
    for loc2 in range (loc1+1,data.shape[0]):
        cosines.append((loc1,loc2,(np.array(data.iloc[loc1,5:12]).dot(np.array(data.iloc[loc2,5:12])))))
print (cosines)

[(0, 1, 1392566600000.0), (0, 2, 12525993400000.0), (0, 3, 7178516600000.0), (0, 4, 50657533330000.0), (0, 5, 12136864290000.0), (0, 6, 7791874330000.0), (0, 7, 2089308610000.0), (0, 8, 1135319720000.0), (0, 9, 39506510970000.0), (0, 10, 20865959650000.0), (0, 11, 973651420000.0), (0, 12, 4366553700000.0), (0, 13, 26546376000000.0), (0, 14, 16838012090000.0), (0, 15, 8325028990000.0), (0, 16, 6956361330000.0), (0, 17, 11871016000000.0), (0, 18, 10206484740000.0), (0, 19, 3824553560000.0), (0, 20, 11647494190000.0), (0, 21, 15581369770000.0), (0, 22, 24630963850000.0), (0, 23, 14116972440000.0), (0, 24, 6512530520000.0), (0, 25, 15608480730000.0), (0, 26, 2775737000000.0), (0, 27, 4741281130000.0), (0, 28, 5011109140000.0), (0, 29, 3700728150000.0), (0, 30, 16945317050000.0), (0, 31, 2615459440000.0), (0, 32, 37547710610000.0), (0, 33, 22439755190000.0), (0, 34, 1950658090000.0), (0, 35, 29793139610000.0), (0, 36, 8262758400000.0), (0, 37, 9823701110000.0), (0, 38, 31483848340000.0), (0

__A9.__ _(3 points)_ What are the two most and two least similar locations based on their races population?

In [19]:
from operator import itemgetter 

maxLoc=max(cosines, key = itemgetter(2))
minLoc=min(cosines, key = itemgetter(2))
print("Most similar are: ",data["Location"][maxLoc[0]],"and",data["Location"][maxLoc[1]])
print("Least similar are: ",data["Location"][minLoc[0]],"and",data["Location"][minLoc[1]])

Most similar are:  California and Texas
Least similar are:  Vermont and Puerto Rico
