## Homework 3: Eigenvector Decomposition

The Homework aims to use the Communities and Crime dataset to compute its eigenvalues and eigenvectors and analyze the principal components. Python 3.7.1 was used for this purpose.

The communities.data taken from [1], and the file communities.names found in [2] containing the data set documentation was used to understand the dataset's contents and to identify the column labels.

The pandas [3] and scikit learn[4] documentation was used as a guide to solve the problems presented in Homework 3. Previously courses taken at DataCamp [5] were also helpful in solving the homework.

## Libraries and modules used

- Pandas was imported to handle the Dataframes and to make the statistics calculations (mean, standard deviation, variance, skewness and mode).
- urllib to open arbitrary files from an URL
- re for regular expression operations

In [1]:
##--Importing necessary libraries
import pandas as pd 
import numpy as np

## Header Creation¶
The dataset was inspected and it was noticed that the dataset contained 128 variables but it didn't contain a header or column names. Creating the header manually seemed troublesome, therefore it was created by following these steps: 
1.	Load the dataset documentation file [5]
2.	Transform everything into a string [6]
3.	Find a pattern using a regular expression [7]

In [2]:
import urllib.request 
import re

target_url='http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names'

regex = re.compile('-- \w+:')

header_list = []

for line in urllib.request.urlopen(target_url):
    line = str(line, 'utf-8')
    four_letter_words = regex.findall(line)
    for word in four_letter_words:
        header_list.append(word)
header_list = header_list[2:]
header_list

['-- state:',
 '-- county:',
 '-- community:',
 '-- communityname:',
 '-- fold:',
 '-- population:',
 '-- householdsize:',
 '-- racepctblack:',
 '-- racePctWhite:',
 '-- racePctAsian:',
 '-- racePctHisp:',
 '-- agePct12t21:',
 '-- agePct12t29:',
 '-- agePct16t24:',
 '-- agePct65up:',
 '-- numbUrban:',
 '-- pctUrban:',
 '-- medIncome:',
 '-- pctWWage:',
 '-- pctWFarmSelf:',
 '-- pctWInvInc:',
 '-- pctWSocSec:',
 '-- pctWPubAsst:',
 '-- pctWRetire:',
 '-- medFamInc:',
 '-- perCapInc:',
 '-- whitePerCap:',
 '-- blackPerCap:',
 '-- indianPerCap:',
 '-- AsianPerCap:',
 '-- OtherPerCap:',
 '-- HispPerCap:',
 '-- NumUnderPov:',
 '-- PctPopUnderPov:',
 '-- PctLess9thGrade:',
 '-- PctNotHSGrad:',
 '-- PctBSorMore:',
 '-- PctUnemployed:',
 '-- PctEmploy:',
 '-- PctEmplManu:',
 '-- PctEmplProfServ:',
 '-- PctOccupManu:',
 '-- PctOccupMgmtProf:',
 '-- MalePctDivorce:',
 '-- MalePctNevMarr:',
 '-- FemalePctDiv:',
 '-- TotalPctDiv:',
 '-- PersPerFam:',
 '-- PctFam2Par:',
 '-- PctKids2Par:',
 '-- Pct

## Loading the File
The dataset was inspected and it was noticed that it is stored as a csv file. It was taken directly from the dataset url and assigned its corresponding column names. The first 5 rows can be seen in the table below. 


In [3]:
#Read from URL
file='http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data'

#Read the  file
df = pd.read_csv(file, names = header_list, delimiter = ",")

print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Columns: 128 entries, -- state: to -- ViolentCrimesPerPop:
dtypes: float64(100), int64(2), object(26)
memory usage: 1.9+ MB
None


Unnamed: 0,-- state:,-- county:,-- community:,-- communityname:,-- fold:,-- population:,-- householdsize:,-- racepctblack:,-- racePctWhite:,-- racePctAsian:,...,-- LandArea:,-- PopDens:,-- PctUsePubTrans:,-- PolicCars:,-- PolicOperBudg:,-- LemasPctPolicOnPatr:,-- LemasGangUnitDeploy:,-- LemasPctOfficDrugUn:,-- PolicBudgPerPop:,-- ViolentCrimesPerPop:
0,8,?,?,Lakewoodcity,1,0.19,0.33,0.02,0.9,0.12,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
1,53,?,?,Tukwilacity,1,0.0,0.16,0.12,0.74,0.45,...,0.02,0.12,0.45,?,?,?,?,0.0,?,0.67
2,24,?,?,Aberdeentown,1,0.0,0.42,0.49,0.56,0.17,...,0.01,0.21,0.02,?,?,?,?,0.0,?,0.43
3,34,5,81440,Willingborotownship,1,0.04,0.77,1.0,0.08,0.12,...,0.02,0.39,0.28,?,?,?,?,0.0,?,0.12
4,42,95,6096,Bethlehemtownship,1,0.01,0.55,0.02,0.95,0.09,...,0.04,0.09,0.02,?,?,?,?,0.0,?,0.03


The dataset contains socioeconomic variables from communities within the United States obtained from three different sources (1990 US Census, 1990 US LEMAS survey and 1995 FBI UCR crime data). The dataset contains 1994 instances and 128 attributes in which the goal attribute is the total number of violent crimes per 100,000 population.

## Dropping unwanted columns
The columns for "-- fold:", "-- community:" and "-- communityname:" were dropped. The first one was dropped because its information is meant to be used when doing cross-validation and this is outside of the scope of this exercise. 
The second and third columns were dropped because they contain unique information for every row without creating a category or classification for the dataset that could add value to the model (The same as having a person’s ID wouldn’t be useful in a dataset for pattern recognition).

In [4]:
#Dropping specific columns
df=df.drop(["-- fold:", "-- community:", "-- communityname:"], axis=1)

## Filling missing data
It can be observed that some attributes contain missing values and this fact affects the calculations of the covariance matrix.  To decide to do with the missing values, a deeper analysis has to be done by replacing missing values for NaN and inspecting how many missing values each column has.

In [5]:
#Replace the '?' characters with NaN values so they can be counted
df = df.replace(to_replace = '?', value = np.nan)

df.isnull().sum()

-- state:                      0
-- county:                  1174
-- population:                 0
-- householdsize:              0
-- racepctblack:               0
-- racePctWhite:               0
-- racePctAsian:               0
-- racePctHisp:                0
-- agePct12t21:                0
-- agePct12t29:                0
-- agePct16t24:                0
-- agePct65up:                 0
-- numbUrban:                  0
-- pctUrban:                   0
-- medIncome:                  0
-- pctWWage:                   0
-- pctWFarmSelf:               0
-- pctWInvInc:                 0
-- pctWSocSec:                 0
-- pctWPubAsst:                0
-- pctWRetire:                 0
-- medFamInc:                  0
-- perCapInc:                  0
-- whitePerCap:                0
-- blackPerCap:                0
-- indianPerCap:               0
-- AsianPerCap:                0
-- OtherPerCap:                1
-- HispPerCap:                 0
-- NumUnderPov:                0
          

The column "County" is a categorical variable and more than 50% of its data is missing values. This column can be dropped since replacing it with the mode doesn’t enhance the dataset and predicting this variable is out of the scope of the homework.

The missing values of the numerical data are replaced by the mean of the corresponding column. The replacement of the missing values with the mean makes the variable to have low variance, thus it’s contribution to the principal component is expected to be low.

The "state" column has no missing values but it hasn't been normalized. Although this variable is described with numbers, it does not have an ordinal meaning (there are no states better than others). That is why it is better if this variable is not normalized in the standard way but encoded. For the purpose of this homework the variable “state” will be dropped.

In [6]:
#Dropping specific columns
df=df.drop(["-- county:", "-- state:" ], axis=1)

In [7]:
from sklearn.impute import SimpleImputer

#Creating imputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(df)
df = imputer.transform(df)

df = pd.DataFrame(df, columns = header_list[5:])
#columns=["state"].append(header_list[5:])
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Columns: 123 entries, -- population: to -- ViolentCrimesPerPop:
dtypes: float64(123)
memory usage: 1.9 MB


Unnamed: 0,-- population:,-- householdsize:,-- racepctblack:,-- racePctWhite:,-- racePctAsian:,-- racePctHisp:,-- agePct12t21:,-- agePct12t29:,-- agePct16t24:,-- agePct65up:,...,-- LandArea:,-- PopDens:,-- PctUsePubTrans:,-- PolicCars:,-- PolicOperBudg:,-- LemasPctPolicOnPatr:,-- LemasGangUnitDeploy:,-- LemasPctOfficDrugUn:,-- PolicBudgPerPop:,-- ViolentCrimesPerPop:
0,0.19,0.33,0.02,0.90,0.12,0.17,0.34,0.47,0.29,0.32,...,0.12,0.26,0.20,0.060000,0.040000,0.900000,0.500000,0.32,0.140000,0.20
1,0.00,0.16,0.12,0.74,0.45,0.07,0.26,0.59,0.35,0.27,...,0.02,0.12,0.45,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.67
2,0.00,0.42,0.49,0.56,0.17,0.04,0.39,0.47,0.28,0.32,...,0.01,0.21,0.02,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.43
3,0.04,0.77,1.00,0.08,0.12,0.10,0.51,0.50,0.34,0.21,...,0.02,0.39,0.28,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.12
4,0.01,0.55,0.02,0.95,0.09,0.05,0.38,0.38,0.23,0.36,...,0.04,0.09,0.02,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.03
5,0.02,0.28,0.06,0.54,1.00,0.25,0.31,0.48,0.27,0.37,...,0.01,0.58,0.10,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.14
6,0.01,0.39,0.00,0.98,0.06,0.02,0.30,0.37,0.23,0.60,...,0.05,0.08,0.06,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.03
7,0.01,0.74,0.03,0.46,0.20,1.00,0.52,0.55,0.36,0.35,...,0.01,0.33,0.00,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.55
8,0.03,0.34,0.20,0.84,0.02,0.00,0.38,0.45,0.28,0.48,...,0.04,0.17,0.04,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.53
9,0.01,0.40,0.06,0.87,0.30,0.03,0.90,0.82,0.80,0.39,...,0.00,0.47,0.11,0.163103,0.076708,0.698589,0.440439,0.00,0.195078,0.15


## Eigenvectors and Eigenvalues

In [8]:
#Numpy coge las filas como vectores. si se va a utilizar df.cov no hay que hacer la transpuesta
df_cov = np.cov(df.transpose())
w,v= np.linalg.eig(df_cov)
divide_ = sum(w)
w = pd.DataFrame(w)
w.head(20)

Unnamed: 0,0
0,1.091788
1,0.760903
2,0.333993
3,0.28748
4,0.187424
5,0.161955
6,0.13352
7,0.111912
8,0.088851
9,0.076813


In [9]:
#"The normalized (unit “length”) eigenvectors, such that the column v[:,i] is the eigenvector corresponding to the eigenvalue w[i]."
v = pd.DataFrame(v)
v.head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,113,114,115,116,117,118,119,120,121,122
0,0.022523,-0.046672,0.071753,-0.033757,0.097118,-0.020894,-0.002693,-0.12121,-0.274178,-0.000647,...,-0.009298,0.02031,-0.017976,-0.013279,0.027252,-0.014353,-0.00321,-0.015885,-0.00366,-0.003775
1,-0.008543,-0.071847,-0.187222,0.076399,0.126941,-0.120064,-0.041485,-0.026724,0.06765,0.06379,...,-0.022805,0.021375,0.05594,0.009485,-0.018689,-0.063827,-0.056007,-0.054234,0.00717,-0.007446
2,0.131474,-0.026898,0.15683,-0.082506,0.169071,-0.27833,0.148856,0.150499,0.145719,-0.063861,...,0.012335,0.003406,-0.008324,-0.009419,-0.007536,-0.008617,0.008366,0.002365,0.008109,-0.0008
3,-0.128525,0.153875,-0.035615,0.049836,-0.123432,0.193112,-0.142137,-0.099005,-0.100596,0.079179,...,0.004053,0.01673,-0.008954,-0.014826,-0.005603,-0.025794,0.04766,-0.001451,0.014974,-0.019165
4,-0.060039,-0.160887,0.021468,0.031787,-0.036723,0.019156,0.02197,0.057344,-0.048577,-0.072849,...,0.00838,-0.000464,-0.004432,0.005746,-0.004179,-0.003507,0.004341,0.021414,-0.002353,0.00016
5,0.065019,-0.185049,-0.18267,0.003067,-0.018489,0.086363,0.03844,-0.126614,-0.020306,-0.003429,...,0.034359,0.006727,-0.007842,-0.001366,-0.014649,-0.009522,0.015405,0.002664,-0.006641,-0.002754
6,0.046326,-0.015091,-0.036498,0.170328,0.048163,-0.139529,-0.148298,-0.048037,0.067977,0.105912,...,0.002664,-0.015499,-0.188762,0.005616,-0.098808,0.174876,-0.042773,0.128966,0.022781,0.024672
7,0.046075,-0.04669,0.022646,0.166044,0.054381,-0.053216,-0.135263,0.025643,0.023608,0.188847,...,0.012417,-0.011695,-0.315301,0.011836,-0.152524,0.290172,-0.068111,0.201874,0.014415,0.075714
8,0.046788,-0.032905,0.043313,0.184066,0.004755,-0.10242,-0.187124,-0.021694,0.063668,0.183075,...,-0.002881,0.020215,0.366755,-0.008022,0.185752,-0.375306,0.076309,-0.247577,-0.035591,-0.078574
9,0.03293,0.06891,0.06654,-0.168354,-0.235168,0.067132,-0.002704,-0.096389,0.064259,-0.129045,...,-0.005402,0.000381,-0.062076,-0.006068,-0.042851,0.037325,-0.022439,0.052291,0.00897,0.020724


The Dataset's eigenvalues and eigenvectors can be seen above. The eigenvalues represent the variance that each component contributes. To choose the cut-off eigenvalue it could be useful to calculate the cumulative eigenvalue contribution to the variance as a percentage.

In [13]:
result = w.divide(divide_)
result = result.multiply(100)

accumulative = int(0)
print("\t" + "Eigenvalue" + "\t" + "Accumulative")
for index, value in result.iterrows():
    accumulative = accumulative + value.iloc[0]
    print(index, value.iloc[0], accumulative)


	Eigenvalue	Accumulative
0 25.967158876704467 25.967158876704467
1 18.09736192171685 44.06452079842131
2 7.943721537806585 52.008242336227894
3 6.837433452528955 58.84567578875685
4 4.457707686809089 63.30338347556594
5 3.851944479363393 67.15532795492933
6 3.175659226140869 70.3309871810702
7 2.6617214521803523 72.99270863325056
8 2.113246094861564 75.10595472811212
9 1.8269332914143688 76.93288801952649
10 1.3638992100009661 78.29678722952745
11 1.3262605017780524 79.6230477313055
12 1.280722304294918 80.90377003560042
13 1.1426185740918808 82.0463886096923
14 1.0736115580157974 83.1200001677081
15 0.9363488496151923 84.05634901732329
16 0.7920291007078792 84.84837811803116
17 0.7125977003068319 85.560975818338
18 0.6743969923512696 86.23537281068927
19 0.6481594375533366 86.8835322482426
20 0.6167886605951222 87.50032090883772
21 0.6046915188667421 88.10501242770447
22 0.5817714899599709 88.68678391766444
23 0.5643820101576518 89.2511659278221
24 0.5327574451813432 89.78392337300345

It can be noticed that the first three components contribute approximately 50% of the variance, and the first 12th correspond to 80%. After the 15th component the contribution is less than 1% so the gain from every added dimension is minimal. 


## Bibliography

Bibliography

[1] 	University of California, Irvine, "UCI Machine Learning Repository: Communities and Crime Data Set," [Online]. Available: http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data. [Accessed 22 January 2019].

[2] 	University of California, Irvine, "UCI Machine Learning Repository: Communities and Crime Data Set Documentation," [Online]. Available: http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.names. [Accessed 22 January 2019].

[3] 	W. McKinney, "pandas: powerful Python data analysis toolkit," 06 08 2018. [Online]. Available: http://pandas.pydata.org/pandas-docs/stable/. [Accessed 09 January 2019].

[4] 	F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, V, J. erplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. 

[5] 	Datacamp, "Data Scientist with Python Track | DataCamp," [Online]. Available: https://www.datacamp.com/tracks/data-scientist-with-python. [Accessed 09 01 2019].

[6] 	Stack Overflow, "In Python, given a URL to a text file, what is the simplest way to read the contents of the text file? - Stack Overflow," Stack Exchange, 22 01 2019. [Online]. Available: https://stackoverflow.com/a/1393367. [Accessed 22 01 2019].

[7] 	Stack Overflow, "How do I convert a Python 3 byte-string variable into a regular string?," Stack Exchange, 25 06 2015. [Online]. Available: https://stackoverflow.com/a/31060836. [Accessed 22 01 2019].

[8] 	B. Welsh, "python recipe: read file, find pattern, print matches. Palewire," Palewire, 14 04 2008. [Online]. Available: https://palewi.re/posts/2008/04/14/python-recipe-read-a-file-search-for-a-pattern-print-your-matches/. [Accessed 22 01 2019].


