# Choose Your ML Problem and Data

In this unit's lab you will implement a model to solve a machine learning problem of your choosing. You will first have to make some decisions. These include:

1. Choosing your data set
2. Identifying your problem type: is it a classification or regression problem?
3. Picking a prediction task 
4. Identifying your label and features
5. Selecting your model
6. Determining data preparation and feature engineering that is needed to build a balanced modeling data set for your problem and model, such as: 
    * creating binary variables
    * addressing missingness, such as replacing missing values with means
    * renaming features and labels
    * finding and replacing outliers
    * performing winsorization if needed
    * performing one-hot encoding on categorical features
    * performing vectorization for an NLP problem
    * addressing class imbalance in your data sample
7. Selecting appropriate techniques to evaluate your model's performance and improve your model


Before you can begin to formulate your machine learning problem, you must select a data set and choose a predictive problem that this data set supports.

In this exercise you will choose your data set, and inspect, analyze and visualize the data with your predictive modeling problem and machine learning model in mind.

### Import Packages

Before you get started, import a few packages. You can import additional packages that you have used in this course that you may need for this task.

In [17]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1: Choose Your Data Set and Load the Data


You will have the option to choose one of three data sets that you have worked with in this program:

* The "adult" data set that contains Census information from 1994: `adultDataFull.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load the Data Set

The code cell below contains filenames (path + filename) for each of the three data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [18]:
# Filenames of the three data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "adultDataFull.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")



df = pd.read_csv(adultDataSet_filename)

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


## Step 2: Choose Your Predictive Problem and Label 

Now that you have chosen your data set, you can choose what you would like to predict, i.e. the label.

<b>Task:</b> Once you have chosen your label, use visualization techniques you have learned to plot the data distribution of the label column's values. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


In [19]:
y = df["marital-status"]
myList = ["age","education","occupation","marital-status"]
myDf = df[myList]
myDf



Unnamed: 0,age,education,occupation,marital-status
0,39.0,Bachelors,Adm-clerical,Never-married
1,50.0,Bachelors,Exec-managerial,Married-civ-spouse
2,38.0,HS-grad,Handlers-cleaners,Divorced
3,53.0,11th,Handlers-cleaners,Married-civ-spouse
4,28.0,Bachelors,Prof-specialty,Married-civ-spouse
...,...,...,...,...
32556,27.0,Assoc-acdm,Tech-support,Married-civ-spouse
32557,40.0,HS-grad,Machine-op-inspct,Married-civ-spouse
32558,58.0,HS-grad,Adm-clerical,Widowed
32559,22.0,HS-grad,Adm-clerical,Never-married


## Step 3: Inspect and Analyze the Data

The next step is to inspect and analyze your data with your machine learning problem in mind, and formulate a plan to prepare your data to create a modeling data set that is appropriate for your problem.

While you will draft your plan to build a modeling data set and implement your model in the written portion of the assignment, you will use this notebook to analyze your data and come up with a data preparation and feature engineering plan. <b>Note:</b> This notebook should be used for investigation and analysis. You will implement data preparation techniques to build your modeling data set in your lab assignment.

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. 

As you are investigating your data, consider the machine learning model you would like to build for this problem, and consider different data preparation and feature engineering techniques that you will need to use.


<b>Task</b>: 

Use the techniques you have learned in this course to inspect and analyze your data. You can add code cells to accomplish this by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.



In [20]:
myDf.isnull().sum()

age                162
education            0
occupation        1843
marital-status       0
dtype: int64

In [21]:
myDf["age"].mean()

38.58921571653446

In [22]:
myDf.shape

(32561, 4)

In [23]:
rows = myDf.loc[myDf["occupation"].isnull()].index
myDf.drop(rows, axis = 0, inplace=True)
rows = myDf.loc[myDf["age"].isnull()].index
myDf.drop(rows, axis = 0, inplace=True)
myDf.isnull().sum()



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


age               0
education         0
occupation        0
marital-status    0
dtype: int64

In [50]:
unique = list(myDf["marital-status"].unique())
for class_val in unique:
    num = sum(myDf["marital-status"] == class_val)
    print(class_val , ":" , num)

Never-married : 9865
Married-civ-spouse : 14270
Divorced : 4235
Married-spouse-absent : 388
Separated : 950
Married-AF-spouse : 21
Widowed : 836
