# Assignment group 1: Textual feature extraction and numerical comparison

## Module A _(35 points)_ Processing numeric data

__A1.__ In this problem, you will be working with the demographics data from The Henry J. Kaiser Family Foundation (https://www.kff.org/) including the population of 52 locations (50 states, District of Columbia, and Puerto Rico) based on race, gender, age, and the number of adults with and without children. This data is obtained from the Census Bureau’s American Community Survey (ACS). The data is stored in a `csv` file format located in the attached `data` directory. Read the `data/demographics.csv` file into a `pandas` dataframe and print the dimensionality (using `.shape` method) and the name of the features of the dataframe. (3 points)

In [16]:
#Libraries in use:
import pandas as pd
import requests
import urllib
from pprint import pprint
import re
import csv
import numpy as np

In [2]:
# code here
df_demographics = pd.read_csv("data/demographics.csv")
df_demographics.shape

(52, 18)

In [3]:
df_demographics.columns

Index(['Location', 'Male', 'Female', 'Adults_with_Children',
       'Adults_with_No_Children', 'White', 'Black', 'Hispanic', 'Asian',
       'American_Indian_or_Alaska_Native',
       'Native_Hawaiian_or_Other_Pacific_Islander', 'Two_Or_More_Races',
       'Age0_18', 'Age_19_25', 'Age_26_34', 'Age_35_54', 'Age_55_64',
       'Age_65_plus'],
      dtype='object')

__A2.__ _(2 points)_ To gain a better insight into the dataframe, show the head and tail of the dataframe (using the `.head()` and `.tail()` methods).

In [4]:
df_demographics.head()

Unnamed: 0,Location,Male,Female,Adults_with_Children,Adults_with_No_Children,White,Black,Hispanic,Asian,American_Indian_or_Alaska_Native,Native_Hawaiian_or_Other_Pacific_Islander,Two_Or_More_Races,Age0_18,Age_19_25,Age_26_34,Age_35_54,Age_55_64,Age_65_plus
0,Alabama,2284900,2456500,878300,1941300,3119100,1259900.0,195700,63800.0,19800.0,,81800.0,1138300,430500,536200.0,1207200,645600,783600
1,Alaska,364500,345600,153600,283800,433000,18000.0,47400,46600.0,95500.0,6300.0,63300.0,192300,63500,97900.0,183600,92400,80500
2,Arizona,3363200,3478100,1322700,2646700,3761600,270200.0,2143400,220300.0,261300.0,12500.0,172000.0,1686200,637200,815200.0,1677000,839900,1185700
3,Arkansas,1422700,1487200,602000,1096800,2109200,434100.0,220300,48100.0,13500.0,7300.0,77500.0,730600,265200,329800.0,726600,377100,480600
4,California,19113000,19600800,7955200,15981800,14305700,2061600.0,15194400,5598000.0,138100.0,122900.0,1293200.0,9363800,3697900,5240600.0,10277900,4720500,5413200


In [5]:
df_demographics.tail()

Unnamed: 0,Location,Male,Female,Adults_with_Children,Adults_with_No_Children,White,Black,Hispanic,Asian,American_Indian_or_Alaska_Native,Native_Hawaiian_or_Other_Pacific_Islander,Two_Or_More_Races,Age0_18,Age_19_25,Age_26_34,Age_35_54,Age_55_64,Age_65_plus
47,Washington,3589700,3650100,1522400,2920700,4976100,246900.0,916000,617900.0,68600.0,42200.0,372200.0,1701700,628700,965900.0,1899400,949100,1095000
48,West Virginia,863400,898700,322100,714100,1631600,64300.0,20400,14200.0,2000.0,,28800.0,385000,148600,178300.0,448700,260600,340900
49,Wisconsin,2792400,2850600,1161900,2230900,4600500,328100.0,382000,166800.0,43500.0,,120600.0,1319600,495600,643300.0,1441400,812500,930500
50,Wyoming,286600,276700,118200,214300,474400,5300.0,56100,4200.0,12100.0,,11200.0,142400,48500,69400.0,133100,81500,88300
51,Puerto Rico,1570300,1734500,543200,1406500,23000,,3274500,,,,,701800,327600,344300.0,849600,428200,653200


__A3.__ _(5 points)_ As you can see, there is no `total` population column in this dataframe for each location. Therefore, create a new column which shows the summation of `Male` and `Female` populations. 

The `pandas` package provides a `.describe()` method which gives a descriptive summary of the desired column(s) (such as mean, standard deviation, min, and max values). Print these statistics for the column `total` which you just created. Then, compare the average population of locations obtained from `.describe()` method with the output of the `mean` method of `numpy` package. Are they the same?

In [6]:
df_demographics['Total'] = df_demographics['Male'] + df_demographics['Female']
df_demographics['Total'].describe()

count    5.200000e+01
mean     6.160138e+06
std      7.092525e+06
min      5.633000e+05
25%      1.742100e+06
50%      4.185050e+06
75%      6.940925e+06
max      3.871380e+07
Name: Total, dtype: float64

__A4:__ _(5 points)_ Find the locations with the minimum and maximum populations.

In [7]:
df_demographics['Location'][df_demographics['Total'].idxmax()]

'California'

In [8]:
df_demographics['Location'][df_demographics['Total'].idxmin()]

'Wyoming'

__A5.__ _(5 points)_ In this part, we are looking at two columns, `Adults_with_Children` and `Adults_with_No_Children`. 
It seems that the populations in these two columns do not include the children (aged younger than 18 years), and older adults (aged older than 64 years). Confirm this hypothesis for `Pennsylvania`, `Colorado`, and `Georgia` that summation of these two columns equal the summation of the population of all age-groups when two `Age0_18` and `Age_65_plus` columns are excluded (For doing that, you may create a new column `total_adults_age_groups` and compare that with the summation of `Adults_with_Children` and `Adults_with_No_Children`). 

In [9]:
df_demographics['total_adults_age_groups'] = df_demographics[
    'Age_19_25'] + df_demographics['Age_26_34'] + df_demographics['Age_35_54'] + df_demographics['Age_55_64']
df_demographics['Hypo_diff'] = df_demographics['Adults_with_No_Children'] + df_demographics[
    'Adults_with_Children'] - df_demographics['total_adults_age_groups']
df_demographics[[
    location == 'Pennsylvania' or location == 'Colorado' or location == 'Georgia' for location in df_demographics[
        'Location']]]['Hypo_diff']

5     0.0
10    0.0
38    0.0
Name: Hypo_diff, dtype: float64

<font color=blue>Which verifies the hypothesis for these three states. Although this is not true for all the states.</font>

__A6:__ _(2 points)_ It seems that our hypothesis is correct for these three states. To make sure, we need to confirm that the differences between total_adults_age_groups column and summation of Adults_with_Children and Adults_with_No_Children for all the locations are zero. You can do that by looking in all the states' differences. Instead of that, create a logical rule (boolean mask) to return `False` if there is at least one non-zero value in the difference between `total_adults_age_groups` column and summation of `Adults_with_Children` and `Adults_with_No_Children`.

In [10]:
df_demographics['total_adults_age_groups'].equals(
    df_demographics['Adults_with_Children'] + df_demographics['Adults_with_No_Children'])

False

In [11]:
df_demographics[[diff!=0.0 for diff in df_demographics['Hypo_diff']]]['Location']

0            Alabama
2            Arizona
3           Arkansas
4         California
7           Delaware
9            Florida
12             Idaho
13          Illinois
15              Iowa
16            Kansas
17          Kentucky
18         Louisiana
23         Minnesota
26           Montana
27          Nebraska
31        New Mexico
33    North Carolina
34      North Dakota
36          Oklahoma
37            Oregon
40    South Carolina
44              Utah
45           Vermont
46          Virginia
Name: Location, dtype: object

__A7:__ _(3 points)_ It seems that our hypothesis is not correct for all the locations. We need to know what are the locations and the exact differences to see if it is due to rounding or there is a really significant difference. Create a dictionary and store locations (as `keys`) and the amount of difference between the population of adults obtained using the summation of two `Adults_with_Children` and `Adults_with_No_Children` columns and those obtained using the population of `Age_19_25`, `Age_26_34`, `Age_35_54`, `Age_55_64` age-groups (as `values`). 

In [12]:
df_for_dict = df_demographics[['Location', 'Hypo_diff']]
df_for_dict.set_index('Location').T.to_dict('list')

{'Alabama': [100.0],
 'Alaska': [0.0],
 'Arizona': [100.0],
 'Arkansas': [100.0],
 'California': [100.0],
 'Colorado': [0.0],
 'Connecticut': [0.0],
 'Delaware': [100.0],
 'District of Columbia': [0.0],
 'Florida': [100.0],
 'Georgia': [0.0],
 'Hawaii': [0.0],
 'Idaho': [-100.0],
 'Illinois': [-100.0],
 'Indiana': [0.0],
 'Iowa': [-100.0],
 'Kansas': [100.0],
 'Kentucky': [100.0],
 'Louisiana': [100.0],
 'Maine': [0.0],
 'Maryland': [0.0],
 'Massachusetts': [0.0],
 'Michigan': [0.0],
 'Minnesota': [100.0],
 'Mississippi': [0.0],
 'Missouri': [0.0],
 'Montana': [-100.0],
 'Nebraska': [100.0],
 'Nevada': [0.0],
 'New Hampshire': [0.0],
 'New Jersey': [0.0],
 'New Mexico': [-100.0],
 'New York': [0.0],
 'North Carolina': [200.0],
 'North Dakota': [100.0],
 'Ohio': [0.0],
 'Oklahoma': [-100.0],
 'Oregon': [-100.0],
 'Pennsylvania': [0.0],
 'Rhode Island': [0.0],
 'South Carolina': [100.0],
 'South Dakota': [0.0],
 'Tennessee': [0.0],
 'Texas': [0.0],
 'Utah': [100.0],
 'Vermont': [-100.0],

__A8.__ _(7 points)_ In this part, we are going to find the similarity of locations based on their races population distributions using _cosine similarity_. In cases where there is no population of one race for a location in the dataframe (the corresponding value is `NaN`), replace them with `zero` using the `.fillna()` method of `pandas`. Then, create a list and append to that each pair of locations with their similarity as a tuple, like: `(loc1, loc2, similarity value)`.


In [13]:
df_demographics.isnull().any()

Location                                     False
Male                                         False
Female                                       False
Adults_with_Children                         False
Adults_with_No_Children                      False
White                                        False
Black                                         True
Hispanic                                     False
Asian                                         True
American_Indian_or_Alaska_Native              True
Native_Hawaiian_or_Other_Pacific_Islander     True
Two_Or_More_Races                             True
Age0_18                                      False
Age_19_25                                    False
Age_26_34                                    False
Age_35_54                                    False
Age_55_64                                    False
Age_65_plus                                  False
Total                                        False
total_adults_age_groups        

<font color=blue>We can see that only the columns related to race have missing values, so we can use .fillna() easily.</font>

In [14]:
df_demographics = df_demographics.fillna(value = 0)
df_demographics.isnull().any()

Location                                     False
Male                                         False
Female                                       False
Adults_with_Children                         False
Adults_with_No_Children                      False
White                                        False
Black                                        False
Hispanic                                     False
Asian                                        False
American_Indian_or_Alaska_Native             False
Native_Hawaiian_or_Other_Pacific_Islander    False
Two_Or_More_Races                            False
Age0_18                                      False
Age_19_25                                    False
Age_26_34                                    False
Age_35_54                                    False
Age_55_64                                    False
Age_65_plus                                  False
Total                                        False
total_adults_age_groups        

__A9.__ _(3 points)_ What are the two most and two least similar locations based on their races population?

In [41]:
dftest = df_demographics.loc[:, "White":"Native_Hawaiian_or_Other_Pacific_Islander"]

In [42]:
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(dftest)

In [43]:
sim

array([[1.        , 0.91729666, 0.85317378, ..., 0.95134596, 0.92995424,
        0.06456499],
       [0.91729666, 1.        , 0.9067029 , ..., 0.97576671, 0.97693181,
        0.11239732],
       [0.85317378, 0.9067029 , 1.        , ..., 0.90616774, 0.91861269,
        0.49864871],
       ...,
       [0.95134596, 0.97576671, 0.90616774, ..., 1.        , 0.99710776,
        0.08945949],
       [0.92995424, 0.97693181, 0.91861269, ..., 0.99710776, 1.        ,
        0.12435632],
       [0.06456499, 0.11239732, 0.49864871, ..., 0.08945949, 0.12435632,
        1.        ]])

In [61]:
from collections import defaultdict
d = defaultdict(list)
for i in range(len(sim)):
    for j in range(len(sim[i])):
      d[sim[i][j]].append((i,j))

for value in sorted(d.keys(), reverse=True)[5:7]:
    print(
        "Most similar states are "+str(
            df_demographics['Location'][d[value][0][0]])+" and "+str(
            df_demographics['Location'][d[value][0][1]])+" with simlarity of "+str(value))
for value in sorted(d.keys(), reverse=False)[0:2]:
    print(
        "Least similar states are "+str(
            df_demographics['Location'][d[value][0][0]])+" and "+str(
            df_demographics['Location'][d[value][0][1]])+" with simlarity of "+str(value))

Most similar states are Missouri and Ohio with simlarity of 0.9999094979355956
Most similar states are Kansas and Nebraska with simlarity of 0.9997223184821147
Least similar states are West Virginia and Puerto Rico with simlarity of 0.019509135645946696
Least similar states are Maine and Puerto Rico with simlarity of 0.023929028243359605
