# TDA Project

The data set to be used can be found in the following link: https://archive.ics.uci.edu/ml/datasets/adult?fbclid=IwAR2yohGbKeXgYqdQgZagCmtUzNzKLnXSOsvjaGy4UfmjhOas_-1mloTZuAg

The first step consists on loading the data and defining an adequate distance space:
The distance space should be constructed as a pseudometric, i.e. it must satisfy triangular inequality

Note some particularities of the data:
- The data is collected in 1996. The GDP data used must be coherent

In [2]:
import pandas as pd
import pyplot 

## Loading and preprocessing the data

The objective of this part of the code is to load the dataset and preprocess it. The output must be a dataframe with only numerical variables then we will compute the distance as the euclidean norm of the difference of two registers. How some variables are mapped into a number is the subject of this section.

In [3]:
# Load dataset adn define columns 
df = pd.read_csv("adult.data", header=None, sep= ', ' )
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]

  


In [4]:
df.head()

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In order to define a distance between two elements of the dataset we have to choose the variables taken into account. 

For these variables a distnace must be defined. In the case of continuous variables like _Age_, _fnlwgt_ or _HoursPerWeek_ an absolute difference value seems like the natural choice.

On the other hand for categorical variables this is more complicated. In variables like Gender we can choose distance 0 if we have the same gender and distance 1 if we have different genders. Other variables like Country can be mapped to a number that helps define how dissimilar the categories are, in this case it was decided that maping each country to the GDP per capita may be an interesting mapping. 

Check for different values from the categorical variables:

In [5]:
df.Occupation.unique()

array(['Adm-clerical', 'Exec-managerial', 'Handlers-cleaners',
       'Prof-specialty', 'Other-service', 'Sales', 'Craft-repair',
       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
       'Tech-support', '?', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)

In [6]:
df.WorkClass.unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [7]:
df.Race.unique()

array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)

Import data from GDP: Obtained from https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD?view=chart

Check for geting the same names as in the original dataset for the inner join to work propperly, gues

In [8]:
GDP_all = pd.read_csv('GDP_percapita.csv', header = 2)
GDP_1996 = GDP_all[['Country Name', '1996']]
GDP_1996 = GDP_1996.rename(columns={'Country Name':'NativeCountry'})

In [51]:
df_gdp = pd.merge(df, GDP_1996, on='NativeCountry', how='inner')

In [54]:
df_gdp.NativeCountry.unique()

array(['Cuba', 'Jamaica', 'India', 'Mexico', 'Honduras', 'Canada',
       'Germany', 'Philippines', 'Italy', 'Poland', 'Cambodia',
       'Thailand', 'Ecuador', 'Haiti', 'Portugal', 'France', 'Guatemala',
       'China', 'Japan', 'Peru', 'Greece', 'Nicaragua', 'Vietnam',
       'Ireland', 'Hungary'], dtype=object)

Transform different variables

In [10]:
df.MaritalStatus.unique()

array(['Never-married', 'Married-civ-spouse', 'Divorced',
       'Married-spouse-absent', 'Separated', 'Married-AF-spouse',
       'Widowed'], dtype=object)

In [11]:
df.Relationship.unique()

array(['Not-in-family', 'Husband', 'Wife', 'Own-child', 'Unmarried',
       'Other-relative'], dtype=object)

In [16]:
df['NetCapital'] = df.CapitalGain - df.CapitalLoss
df = df.drop(['Education', 'CapitalGain', 'CapitalLoss'], 1)

In [13]:
df.head(50)

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


## Distance calculation

In [17]:
data_processed = df[["Age", "EducationNum", "NetCapital", "HoursPerWeek"]]