# Data analysis project on Income data set 

## Project Overview

The main task is to carryout an exploratory data analysis and apply some of the learned methodologies. The data set I will be working on is about incomes.

## Importing libraries

In [107]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from scipy import stats


## Understanding the dataset

### Loading

Loading the dataset.

In [86]:
# From csv file to dataframe
raw_df = pd.read_csv("/Users/emmatosato/Documents/UNI Locale/Erasmus/Statistical Computing/StatisticalComputing_Project/data50_synth.csv")

### Data set informations 

Retrieving some information about the data set.

In [87]:
# Number of rows and columns
raw_df.shape

(10000, 52)

In [88]:
# Informations about the data set
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 52 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      10000 non-null  int64 
 1   income_level    10000 non-null  int64 
 2   sex             10000 non-null  object
 3   age             10000 non-null  int64 
 4   town            10000 non-null  object
 5   occupation      10000 non-null  object
 6   height          10000 non-null  int64 
 7   1sibling        10000 non-null  int64 
 8   2home           10000 non-null  int64 
 9   3educ           10000 non-null  int64 
 10  4travel         10000 non-null  int64 
 11  5credit         10000 non-null  int64 
 12  6cr sum         10000 non-null  int64 
 13  7card           10000 non-null  int64 
 14  8pastime        10000 non-null  int64 
 15  9changes        10000 non-null  int64 
 16  10daily travel  10000 non-null  int64 
 17  11travel mode   10000 non-null  int64 
 18  12consc

The main aspect that we can hightlight is that the majority of the variables (49) are integer values (numeric and dummy variables).

In [89]:
# Column's names
raw_df.columns

Index(['Unnamed: 0', 'income_level', 'sex', 'age', 'town', 'occupation',
       'height', '1sibling', '2home', '3educ', '4travel', '5credit', '6cr sum',
       '7card', '8pastime', '9changes', '10daily travel', '11travel mode',
       '12conscious', '13accounts', '14sum', '15mobile', '16mobtime', '17car',
       '18internet', '19sport', '20vegan', '21civil', '22polit', '23ill',
       '24medicine', '25doctor', '26covid', '27advert', '28culture', '29food',
       '30children', '31rooms', '32valuables', '33cloths', '34jewels',
       '35smoke', '36alcohol', '37music', '38face', '39twitter', '40online1',
       '41online2', '42TV', '43refurb', '44move', '45travdest'],
      dtype='object')

In [90]:
# First 5 rows 
raw_df.head(5)

Unnamed: 0.1,Unnamed: 0,income_level,sex,age,town,occupation,height,1sibling,2home,3educ,...,36alcohol,37music,38face,39twitter,40online1,41online2,42TV,43refurb,44move,45travdest
0,0,1,female,60,Nagykanizsa,Szakorvos,168,3,0,4,...,1,2,1,0,2,2,0,1,1,1
1,1,1,male,40,Nagykanizsa,Lakatos,170,4,0,3,...,1,0,0,0,1,2,1,1,1,1
2,2,0,male,49,Budakeszi,asztalos,196,3,0,1,...,3,2,0,0,0,0,1,0,1,1
3,3,0,male,47,Nagykanizsa,Szakorvos,166,0,0,3,...,3,1,1,0,2,0,1,1,0,1
4,4,0,female,41,Budakeszi,asztalos,173,3,0,3,...,1,2,1,0,0,0,1,1,0,1


In [91]:
# Last 5 rows 
raw_df.tail(5)

Unnamed: 0.1,Unnamed: 0,income_level,sex,age,town,occupation,height,1sibling,2home,3educ,...,36alcohol,37music,38face,39twitter,40online1,41online2,42TV,43refurb,44move,45travdest
9995,9995,0,female,44,Budakeszi,Lakatos,168,0,2,1,...,1,2,1,0,1,0,0,0,0,1
9996,9996,0,male,54,Nagykanizsa,Lakatos,172,0,0,4,...,1,2,0,0,1,2,0,0,0,1
9997,9997,1,female,62,Nagykanizsa,Lakatos,192,4,0,3,...,0,2,0,0,2,2,1,1,1,4
9998,9998,1,male,56,Budakeszi,Szakorvos,164,0,0,2,...,1,2,0,0,1,2,0,1,0,2
9999,9999,0,male,58,Nagykanizsa,Lakatos,171,3,0,0,...,1,2,0,0,1,0,0,0,0,3


### Columns meaning

In [92]:
# number of unique values for each variable
raw_df.nunique(axis=0)

Unnamed: 0        10000
income_level          2
sex                   2
age                  41
town                  3
occupation            3
height               60
1sibling              5
2home                 3
3educ                 5
4travel               2
5credit               2
6cr sum               4
7card                 2
8pastime             12
9changes              4
10daily travel       31
11travel mode         5
12conscious           2
13accounts            2
14sum              5398
15mobile              2
16mobtime             2
17car                 2
18internet            2
19sport               2
20vegan               2
21civil               2
22polit               2
23ill                 2
24medicine            2
25doctor              6
26covid               4
27advert              2
28culture             5
29food                5
30children            5
31rooms               5
32valuables        2769
33cloths              4
34jewels              4
35smoke         

We can see most of the variables have from 2 up to 5 unique values, while only few are charcterized by a very large variety. Below we explore these with more detail.

In [125]:
# Categorical variables
categorical_vars = raw_df.select_dtypes(include=['object']).columns.tolist()
print(categorical_vars)

['sex', 'town', 'occupation']


In [94]:
# Searching the binary variables 
binary_vars = []

for column in raw_df.columns:
    unique_values = raw_df[column].unique()
    if len(unique_values) == 2:
        binary_vars.append(column)

print(binary_vars)

['income_level', 'sex', '4travel', '5credit', '7card', '12conscious', '13accounts', '15mobile', '16mobtime', '17car', '18internet', '19sport', '20vegan', '21civil', '22polit', '23ill', '24medicine', '27advert', '35smoke', '38face', '39twitter', '43refurb', '44move']


In [95]:
# Searching the NON binary variables 
NON_binary_vars = []

for column in raw_df.columns:
    unique_values = raw_df[column].unique()
    if len(unique_values) > 2:
        NON_binary_vars.append(column)

print(NON_binary_vars)

['Unnamed: 0', 'age', 'town', 'occupation', 'height', '1sibling', '2home', '3educ', '6cr sum', '8pastime', '9changes', '10daily travel', '11travel mode', '14sum', '25doctor', '26covid', '28culture', '29food', '30children', '31rooms', '32valuables', '33cloths', '34jewels', '36alcohol', '37music', '40online1', '41online2', '42TV', '45travdest']


In [96]:
# Understanding the max values for the non binary and non categorical variables
non_binary_df = raw_df[NON_binary_vars]
non_binary_df.drop(['Unnamed: 0', 'town', 'occupation'], axis=1, inplace= True)
non_binary_df.max()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_binary_df.drop(['Unnamed: 0', 'town', 'occupation'], axis=1, inplace= True)


age                     65
height                 204
1sibling                 4
2home                    2
3educ                    4
6cr sum                  3
8pastime                14
9changes                 3
10daily travel          30
11travel mode            4
14sum             56122200
25doctor                 5
26covid                  3
28culture                4
29food                   4
30children               4
31rooms                  4
32valuables         164676
33cloths                 3
34jewels                 3
36alcohol                3
37music                  3
40online1                2
41online2                2
42TV                     2
45travdest               4
dtype: int64

Taking into account all the functions seen, we can deduce the following information about the variables:
- sex, age, town, occupation and height are the usual features

- income_level, 4travel, 5credit, 7card, 12conscious, 13accounts, 15mobile, 16mobtime, 17car, 18internet, 19sport, 20vegan, 21civil, 22polit, 23ill, 24medicine, 27advert, 35smoke, 38face, 39twitter, 43refurb and 44move are all binary variable (0 or 1). These could mean the presence (1) or absence (0) of a particular characteristic or behavior.

- 3educ might be the education level.

- 14sum is a sum of certain information.

- 32valuables seemes to represent a quantitative measure values valuable possessions.

- 1sibling, 2home, 6cr sum, 8pastime, 9changes, 10daily travel, 11travel mode, 25doctor, 26covid, 28culture, 29food, 30children, 31rooms, 33cloths, 34jewels, 36alcohol, 37music, 40online1, 41online2, 42TV, 45travdest could be 
    - the number of a particular characteristic or behavior (such as the number of sibling, houses and so on)
    
    - encoding for different categories of that feature.


### Some statistics

In order to make the statistics more comprehensible and readable i will separate the variables in subset.

In [97]:
usual_df = raw_df[['income_level', 'sex', 'age', 'town', 'occupation', 'height']]
binary_df = raw_df[binary_vars]
binary_df = raw_df[binary_vars].drop(['income_level', 'sex'], axis=1)
non_binary_df = non_binary_df.drop(['age', 'height'], axis=1)

In [106]:
usual_df.describe()

Unnamed: 0,income_level,age,height
count,10000.0,10000.0,10000.0
mean,0.5623,44.549,169.9775
std,0.496128,12.350281,9.746543
min,0.0,25.0,145.0
25%,0.0,33.0,164.0
50%,1.0,45.0,170.0
75%,1.0,55.0,176.0
max,1.0,65.0,204.0


In [104]:
binary_df.describe()

Unnamed: 0,4travel,5credit,7card,12conscious,13accounts,15mobile,16mobtime,17car,18internet,19sport,...,21civil,22polit,23ill,24medicine,27advert,35smoke,38face,39twitter,43refurb,44move
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.7175,0.9676,0.9245,0.0168,0.4196,0.7437,0.1266,0.3259,0.1751,0.8104,...,0.3595,0.7794,0.7618,0.4064,0.1121,0.9312,0.5262,0.3622,0.6242,0.4097
std,0.450238,0.177069,0.26421,0.128528,0.493518,0.436611,0.332541,0.468734,0.380072,0.392004,...,0.479878,0.414672,0.426003,0.491185,0.315505,0.253126,0.499338,0.48066,0.484353,0.491803
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
75%,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [105]:
non_binary_df.describe()

Unnamed: 0,1sibling,2home,3educ,6cr sum,8pastime,9changes,10daily travel,11travel mode,14sum,25doctor,...,31rooms,32valuables,33cloths,34jewels,36alcohol,37music,40online1,41online2,42TV,45travdest
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,2.2782,0.3547,2.7505,1.1622,1.7529,1.9869,2.4031,1.1617,996820.3,2.3436,...,1.915,2108.8754,2.0387,1.3458,1.7817,1.4178,1.0686,0.8927,0.5157,1.6087
std,1.630604,0.716756,1.039403,0.466168,1.498354,0.910061,2.832951,1.432256,2859354.0,1.860399,...,1.615499,9411.498664,0.706011,1.096787,0.962464,0.783903,0.653098,0.930308,0.549712,1.134422
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,2.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0,...,0.0,0.0,2.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0
50%,3.0,0.0,3.0,1.0,1.0,2.0,2.0,1.0,184235.0,2.0,...,2.0,299.0,2.0,1.0,2.0,2.0,1.0,1.0,0.0,1.0
75%,4.0,0.0,3.0,1.0,2.0,3.0,3.0,1.0,1355805.0,5.0,...,3.0,1147.0,2.0,2.0,3.0,2.0,1.0,2.0,1.0,3.0
max,4.0,2.0,4.0,3.0,14.0,3.0,30.0,4.0,56122200.0,5.0,...,4.0,164676.0,3.0,3.0,3.0,3.0,2.0,2.0,2.0,4.0


## Cleaning the data set

### Redundant variables

I dropped a variable that I think is redundant, the "Unnamed" column, which is index of each observation. Since the datframe uses indexes for the obejects it stores, I consider it  redundant to have a variable that does the exact same thing.

In [99]:
# Dropping column
df = raw_df.drop(['Unnamed: 0'], axis=1)

### Renaming columns

Renaming columns for clearance.

In [100]:
# lstrip removes leading digits
df.columns = [col.lstrip('0123456789') for col in df.columns]
df.columns

Index(['income_level', 'sex', 'age', 'town', 'occupation', 'height', 'sibling',
       'home', 'educ', 'travel', 'credit', 'cr sum', 'card', 'pastime',
       'changes', 'daily travel', 'travel mode', 'conscious', 'accounts',
       'sum', 'mobile', 'mobtime', 'car', 'internet', 'sport', 'vegan',
       'civil', 'polit', 'ill', 'medicine', 'doctor', 'covid', 'advert',
       'culture', 'food', 'children', 'rooms', 'valuables', 'cloths', 'jewels',
       'smoke', 'alcohol', 'music', 'face', 'twitter', 'online1', 'online2',
       'TV', 'refurb', 'move', 'travdest'],
      dtype='object')

### Missing values

In [101]:
# Checking for missing values
df[df.isnull().any(axis=1)]

Unnamed: 0,income_level,sex,age,town,occupation,height,sibling,home,educ,travel,...,alcohol,music,face,twitter,online1,online2,TV,refurb,move,travdest


There are no missing values in the dataset.

### Outliers

Analyzing the previous statistics I didn't observe outliers, however I can try also with a more formal method. 

In [130]:
categorical_vars

['sex', 'town', 'occupation']

In [132]:
non_categorical = df.copy().drop(categorical_vars, axis = 1)

# Calculate Z-scores on the the u column (i picked u as an example)
z_scores = stats.zscore(non_categorical)

# Identify outliers based on Z-scores
outliers = (z_scores >3.5) | (z_scores < -3.5)

# Outliers
non_categorical[outliers == True].value_counts()

Series([], dtype: int64)

Z-score threshold of 3.5 is quite strict and would identify data points that are more extreme, however we don't observe outliers, as assumed before.

## Analyzing relationships between variables