## OBJECTIVES

- We want to get a taste of data manipulation and a little bit of machine learning using python.
- This will be just the tip of the iceberg, but a crucial step that will let us practice our skills. 
- Let's do a very simple example, but ensure that we understand every step along the way

<img src="data/images/ml.jpg" style="width:700px;" />

## OVERVIEW

<img src="data/images/adults.png" style="width:700px;"/>

- We are going to use a very popular dataset borrowed from one of the open-source Machine Learning Repositories
- Let's say this dataset is called 'Adult' and it contains information about different adults and their profiles. 
- There are two files: one is used to 'train' a machine (build  a model) and the other one is used for testing. 
- In this dataset the dependent variable is 'target' ( i.e. we focus on the column called 'target')
- This example is a binary classification problem. 
- __Objective:__ _teach a machine to predict if the salary of a given person is less than or more than 50K_

In [1]:
import pandas as pd

#loading data

train  = pd.read_csv("data/ml1_train.csv")
test = pd.read_csv("data/ml1_test.csv")

In [2]:
#let's review our main dataset
train.info()

# RESULT: the train data has 32561 rows and 15 columns. 
# Out of these 15 columns, 6 have integers classes and the rest have object (or character) classes. 
# *we can do a similar check for test data. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  31978 non-null  object
 14  target          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [3]:
# alternative way of checking dataset columns

print (f"The train data has {train.shape[0]} rows and {train.shape[1]} columns")
print (f"The test data has {test.shape[0]} rows and {test.shape[1]} columns")


The train data has 32561 rows and 15 columns
The test data has 16281 rows and 15 columns


In [4]:
# see what the data looks like

train.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [None]:
# check if we have missing values in our data

na_train = train.shape[0] - train.dropna().shape[0]
print (f"{na_train} rows have missing values in the train data")

na_test = test.shape[0] - test.dropna().shape[0]
print (f"{na_test} rows have missing values in the test data")

In [None]:
# Now let's check which columns have missing values
train.isnull().sum()

In [None]:
# Let's check our character columns - how many unique values does each column hold

categories = train.select_dtypes(include=['object'])
categories.apply(pd.Series.nunique)

#### Dealing with missing values

 - we identified that we had three columns where some values were missing
 - these columns affect our records, but we don't want to scrap or lose them completely!
 - Let's impute these missing values with respective modes
 
 

In [None]:
train.workclass.value_counts(sort=True)

In [None]:
train.occupation.value_counts(sort=True)

In [None]:
train['native.country'].value_counts(sort=True)

In [None]:
# replacing NULL values with most suitable values 
# (replacing based on existing data -  check your .csv files to understand where these come from )

#Workclass
train.workclass = train.workclass.str.strip()
train.workclass.fillna('Private',inplace=True)


#Occupation
train.occupation = train.occupation.str.strip()
train.occupation.fillna('Prof-specialty',inplace=True)


#Native Country
train['native.country'] = train['native.country'].str.strip()
train['native.country'].fillna('United-States',inplace=True)

In [None]:
# Check no missing values are left in the dataset
train.isnull().sum()

In [None]:
# Let's check the 'target' variable (values in the 'target' column) to investigate if this data is imbalanced or not.

# check proportions % of target variable (there are only 2 unique values, 
# so we should get a % split of these values in our dataset)
train.target.value_counts()/train.shape[0]

<div class="alert alert-block alert-info">
<b>75% of the data set belongs to less than 50K class. This means that even if we take a rough guess of target prediction as less than 50K, we'll get 75% accuracy.</b>
Let's create a cross tab of the target variable with education. With this, we'll try to understand the influence of education on the target variable.
</div>

In [None]:
# Multiplying by 100 to make the percentages easier to read
pd.crosstab(train.education, train.target,margins=True)/train.shape[0]*100



### FINDINGS: 

__out of 75% people with <=50K salary:__ 

- 27% people are high school graduates, which is correct as people with lower levels of education are expected to earn less. 

__out of 25% people with >=50K salary:__ 

- 6% are bachelors and 5% are high-school grads. ==> _this pattern seems to be a matter of concern! We have to consider more variables before coming to a conclusion._




## SCIKIT

- Let's try to utilise the mighty SciKit library for the next step. 
- https://scikit-learn.org/stable/

- __IMPORTANT: Scikit accepts data in numeric format! It means that we need to convert the character variables into numeric.__  


- To do our conversion, we will use the _labelencoder_ function.

- __IMPORTANT: In label encoding, each unique value of a variable gets assigned a number.__ Example: a variable fruit has four values: 'apple', 'banana', 'kiwi', 'melon'. Label encoding this variable will return output as: apple = 2 banana = 0 kiwi = 1 melon = 3

In [None]:
# let's encode all object type variables

from sklearn import preprocessing

for x in train.columns:
    if train[x].dtype == 'object':
        lbl = preprocessing.LabelEncoder() # encoding
        lbl.fit(list(train[x].values)) # fitting the model
        train[x] = lbl.transform(list(train[x].values))


In [None]:
# examine the dataset again to see how our encoding changes have been applied

train.head(20)

### CLOSING REMARKS

1. It was our first step towards familiarising ourselves with the world of Machine Learning.
1. Hopeully you found it interesting!
1. Next time we will cover more individual examples of ML processing and SciKit library. 
1. Remember, it takes months and sometimes years to build, fit and validate solid ML models. We won't be able to complete the full cycle, but we will focus on key concepts, preprocessing and visualisation ¯\_(ツ)_/¯

<img src="data/images/remarks.jpeg" style="width:200px;" />