# 1.0 An end-to-end classification problem (Part I)



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Load libraries

In [2]:
import wandb
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
import tempfile
import os

## 1.3 Get data & Exploratory Data Analysis (EDA)

### 1.3.1 Create the raw_data artifact

In [3]:
# columns used 
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 
           'sex','capital_gain', 'capital_loss', 'hours_per_week',
           'native_country','high_income']
# importing the dataset
income = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                   header=None,
                   names=columns)
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
income.to_csv("raw_data.csv",index=False)

In [18]:
!wandb login --relogin 197a1eaed7c076c3734812f422ae1d2b03085f2f

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/ivanovitchsilva/.netrc


In [19]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name week_07_eda/raw_data.csv \
      --type raw_data \
      --description "The raw data from 1994 US Census" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "ivanovitchm/week_07_eda/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mivanovitchm[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: Tracking run with wandb version 0.12.9
[34m[1mwandb[0m: Syncing run [33mprime-jazz-1[0m
[34m[1mwandb[0m:  View project at [34m[4mhttps://wandb.ai/ivanovitchm/week_07_eda[0m
[34m[1mwandb[0m:  View run at [34m[4mhttps://wandb.ai/ivanovitchm/week_07_eda/runs/2dz6cwyr[0m
[34m[1mwandb[0m: Run data is saved locally in /Users/ivanovitchsilva/mlops/week_07/Example_01/wandb/run-20211227_221152-2dz6cwyr
[34m[1mwandb[0m: Run `wandb offline` to turn off syncing.

Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("ivanovitchm/week_07_eda/raw_data.csv:latest")


[34m[1mwandb[0m: Waiting for W&B process to finish, PID 61650... (success).
[34m[1mwandb[0m:                                           

### 1.3.2 Download raw_data artifact from Wandb

In [20]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="week_07_eda", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mivanovitchm[0m (use `wandb login --relogin` to force relogin)


In [21]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("week_07_eda/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [22]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  high_income     32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [24]:
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


### 1.3.3 Pandas Profilling

In [26]:
!pip install ipywidgets==7.6.5

Collecting ipywidgets==7.6.5
  Using cached ipywidgets-7.6.5-py2.py3-none-any.whl (121 kB)
Collecting jupyterlab-widgets>=1.0.0
  Using cached jupyterlab_widgets-1.0.2-py3-none-any.whl (243 kB)
Collecting widgetsnbextension~=3.5.0
  Using cached widgetsnbextension-3.5.2-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-7.6.5 jupyterlab-widgets-1.0.2 widgetsnbextension-3.5.2


In [27]:
ProfileReport(df, title="Pandas Profiling Report", explorative=True)

ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html



In [None]:
# There are duplicated rows
df.duplicated().sum()

In [None]:
# Delete duplicated rows
df.drop_duplicates(inplace=True)
df.duplicated().sum()


### 1.3.5 EDA Manually

In [None]:
# what the sex column can help us?
pd.crosstab(df.high_income,df.sex,margins=True)

In [None]:
# income vs [sex & race]?
pd.crosstab(df.high_income,[df.sex,df.race])

In [None]:
%matplotlib inline

sns.catplot(x="sex", 
            hue="race", 
            col="high_income",
            data=df, kind="count",
            height=4, aspect=.7)
plt.show()

In [None]:
g = sns.catplot(x="sex", 
                hue="workclass", 
                col="high_income",
                data=df, kind="count",
                height=4, aspect=.7)

g.savefig("HighIncome_Sex_Workclass.png", dpi=100)

run.log(
        {
            "High_Income vs Sex vs Workclass": wandb.Image("HighIncome_Sex_Workclass.png")
        }
    )

In [None]:
df.isnull().sum()

## 1.4 Train & Split

In [None]:
splits = {}
splits["train"], splits["test"] = train_test_split(df,
                                                   test_size=0.30,
                                                   random_state=41,
                                                   stratify=df["high_income"])

In [None]:
# Save the artifacts. We use a temporary directory so we do not leave
# any trace behind

with tempfile.TemporaryDirectory() as tmp_dir:

    for split, df in splits.items():

        # Make the artifact name from the provided root plus the name of the split
        artifact_name = f"data_{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(
            name=artifact_name,
            type="raw_data",
            description=f"{split} split of dataset week_07_eda/raw_data.csv:latest",
        )
        artifact.add_file(temp_path)

        run.log_artifact(artifact)

        # This waits for the artifact to be uploaded to W&B. If you
        # do not add this, the temp directory might be removed before
        # W&B had a chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

### 1.4.1 Donwload the train and test artifacts

In [None]:
# donwload the latest version of artifacts data_test.csv and data_train.csv
artifact_train = run.use_artifact("week_07_eda/data_train.csv:latest")
artifact_test = run.use_artifact("week_07_eda/data_test.csv:latest")

# create a dataframe from each artifact
df_train = pd.read_csv(artifact_train.file())
df_test  = pd.read_csv(artifact_test.file())

In [None]:
print("Train: {}".format(df_train.shape))
print("Test: {}".format(df_test.shape))

In [None]:
plot_diff([df_train,df_test])

In [None]:
run.finish()