<a href="https://colab.research.google.com/github/simoneSantoni/people-analytics-smm639/blob/main/lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SMM639 - People Analytics, Week 10 lab & coursework



---

__Synopsis:__ This notebook expands on the EDA reported in [EDA_Adult_Data_EDA.Rmd](https://github.com/simoneSantoni/people-analytics-smm639/blob/main/session1/EDA_Adult_Data_EDA.Rmd) to:

*   read & clean the relevant data using Python
*   introduce the coursework  

---
__Info:__

_Authors:_ Vali Asimit [Alexandru.Asimit.1@city.ac.uk] & Simone Santoni [simone.santoni.1@city.ac.uk]

_History:_ created on Mon Nov  30 07:44:17

# Setup

In [2]:
# libraries
import os
import glob
import numpy as np
import matplotlib as pyplot
import seaborn as sns
import pandas as pd

# Load data

## UCI Data

### List files

In [5]:
# collect data
# --+ target dataset
file_names = ['adult.data', 'adult.names', 'adult.test']
# --+ base url
base_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/'
# --+ assemble urls
file_locations = ['{}{}'.format(base_url, i) for i in file_names]
in_files = dict(zip(file_names, file_locations))

In [4]:
# collection of files to iterate
in_files

{'adult.data': 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
 'adult.names': 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names',
 'adult.test': 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'}

### Create a 'read-and-clean' pipeline

In [None]:
def read_and_clean(target_, nan_):
    """
    params
    ------
    target_ : a string, name of the target file 
    nan_    : the string uses to encode NANs 

    return
    ------
    a Pandas DF without NANs

    """
    # file location
    url = in_files[target_]
    # read data
    # --+ there's sort of comment in the first row of 'test'
    if 'test' not in target_:
        df = pd.read_csv(url, header=None)
    else:
        df = pd.read_csv(url, header=None, skiprows=1)
    # assign names to columns
    new_cols = ['X{}'.format(i) for i in range(1, 15)]
    new_cols.extend('Y')
    old_cols = df.columns
    df.rename(columns=dict(zip(old_cols, new_cols)), inplace=True)
    # deal with NANs
    # --+ replace '?' with NANs
    for col in df.columns:
        col_type = str(df[col].dtype)
        if col_type == 'object':
            pattern = '\{}'.format(nan_)
            df.loc[df[col].str.contains(pattern), col] = np.nan
        else:
            pass
    # --+ drop NANs listwise
    df.dropna(inplace=True)
    # gimme the output!
    return df



### Reading data and get info

#### Train

In [None]:
# read and clean
train = read_and_clean(target_='adult.data', nan_='?')
# get info
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X1      30162 non-null  int64 
 1   X2      30162 non-null  object
 2   X3      30162 non-null  int64 
 3   X4      30162 non-null  object
 4   X5      30162 non-null  int64 
 5   X6      30162 non-null  object
 6   X7      30162 non-null  object
 7   X8      30162 non-null  object
 8   X9      30162 non-null  object
 9   X10     30162 non-null  object
 10  X11     30162 non-null  int64 
 11  X12     30162 non-null  int64 
 12  X13     30162 non-null  int64 
 13  X14     30162 non-null  object
 14  Y       30162 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


#### Test 

In [None]:
# read and clean
test = read_and_clean(target_='adult.test', nan_='?')
# get info
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15060 entries, 0 to 16280
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X1      15060 non-null  int64 
 1   X2      15060 non-null  object
 2   X3      15060 non-null  int64 
 3   X4      15060 non-null  object
 4   X5      15060 non-null  int64 
 5   X6      15060 non-null  object
 6   X7      15060 non-null  object
 7   X8      15060 non-null  object
 8   X9      15060 non-null  object
 9   X10     15060 non-null  object
 10  X11     15060 non-null  int64 
 11  X12     15060 non-null  int64 
 12  X13     15060 non-null  int64 
 13  X14     15060 non-null  object
 14  Y       15060 non-null  object
dtypes: int64(6), object(9)
memory usage: 1.8+ MB


## Data with c-SVM predictions

In [14]:
# url pointing to the GitHub repo
url = 'https://raw.githubusercontent.com/simoneSantoni/'\
      'people-analytics-smm639/main/session1/cSVM.json'
pr = pd.read_json(url)
# get info
pr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15060 entries, 0 to 15059
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Y                 15060 non-null  int64 
 1   X1                15060 non-null  int64 
 2   X2                15060 non-null  object
 3   X3                15060 non-null  int64 
 4   X4                15060 non-null  object
 5   X5                15060 non-null  int64 
 6   X6                15060 non-null  object
 7   X7                15060 non-null  object
 8   X8                15060 non-null  object
 9   X9                15060 non-null  object
 10  X10               15060 non-null  object
 11  X11               15060 non-null  int64 
 12  X12               15060 non-null  int64 
 13  X13               15060 non-null  int64 
 14  prediction_C_SVM  15060 non-null  int64 
dtypes: int64(8), object(7)
memory usage: 1.8+ MB


# Coursework: visualization of misclassified cases

---
__Context__:

+ focus on misclassified cases – i.e., cases for which $\hat{y_{i}} = 1$ & $y_{i} = -1$ 
+ design and execute a Bokeh dashboard that help a business analyst:
  * to understand how misclassifiied cases map onto the features $X1 - X13$
  * to understand how misclassidfied cases related with cases that are correclty classified 

---
__Deliverables__:

+ an executive summary of one page (circa 400 words) illustrating the key insights emerging from the dashboard
+ the Python code behing the Bokeh dashboard
+ a .hmtl file cointaining the dashboard

---
__Deadline__ is December 15, 2020; submit your package via Moodle
