### Task Structured Tabular Data:

#### Dataset Link:
Dataset can be found at " /data/structured_data/data.csv " in the respective challenge's repo.

#### Description:
Tabular data is usually given in csv format (comma-separated-value). CSV files can be read and manipulated using pandas and numpy library in python. Most common datatypes in structured data are 'numerical' and 'categorical' data. Data processing is required to handle missing values, inconsistent string formats, missing commas, categorical variables and other different kinds of data inadequacies that you will get to experience in this course. 

#### Objective:
How to process and manipulate basic structured data for machine learning (Check out helpful links section to get hints)

#### Tasks:
- Load the csv file (pandas.read_csv function)
- Classify columns into two groups - numerical and categorical. Print column names for each group.
- Print first 10 rows after handling missing values
- One-Hot encode the categorical data
- Standarize or normalize the numerical columns

#### Ask yourself:

- Why do we need feature encoding and scaling techniques?
- What is ordinal data and should we one-hot encode ordinal data? Are any better ways to encode it?
- What's the difference between normalization and standardization? Which technique is most suitable for this sample dataset?
- Can you solve the level-up challenge: Complete all the above tasks without using scikit-learn library ?

#### Helpful Links:
- Nice introduction to handle missing values: https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
- Scikit-learn documentation for one hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- Difference between normalization and standardization: https://medium.com/towards-artificial-intelligence/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

In [1]:
# Import the required libraries
# Use terminal commands like "pip install numpy" to install packages
import numpy as np
import pandas as pd
# import sklearn if and when required

In [2]:
!wget https://raw.githubusercontent.com/DeepConnectAI/challenge-week-1/master/data/structured_data/data.csv

--2020-08-17 13:02:15--  https://raw.githubusercontent.com/DeepConnectAI/challenge-week-1/master/data/structured_data/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 473 [text/plain]
Saving to: ‘data.csv’


2020-08-17 13:02:15 (21.8 MB/s) - ‘data.csv’ saved [473/473]



In [32]:
df = pd.read_csv('data.csv')

In [33]:
df.head(10)
df.fillna(axis = 0, method = 'ffill')

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,France,44.0,72000.0,No,1
1,Spain,27.0,48000.0,Yes,1
2,Germany,30.0,54000.0,No,2
3,Spain,38.0,61000.0,No,3
4,Germany,40.0,61000.0,Yes,1
5,France,35.0,58000.0,Yes,2
6,Spain,35.0,52000.0,No,3
7,France,48.0,79000.0,Yes,1
8,Germany,50.0,83000.0,No,2
9,France,37.0,67000.0,Yes,2


In [34]:
numerical = ['Age','Salary','Price Category Of Purchase']
categorical = ['Country','Purchased']
c = df[categorical]
n = df[numerical]
df.isna()

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,True,False,False
5,False,False,False,False,False
6,False,True,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [35]:
df["Age"].fillna(0, inplace = True)
df["Salary"].fillna(0, inplace = True)  

In [36]:
df.isna()

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


In [37]:
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import LabelEncoder

In [38]:
le = LabelEncoder() 
  
df['Country']= le.fit_transform(df['Country']) 
df['Purchased']= le.fit_transform(df['Purchased']) 

In [39]:
df

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,44.0,72000.0,0,1
1,2,27.0,48000.0,1,1
2,1,30.0,54000.0,0,2
3,2,38.0,61000.0,0,3
4,1,40.0,0.0,1,1
5,0,35.0,58000.0,1,2
6,2,0.0,52000.0,0,3
7,0,48.0,79000.0,1,1
8,1,50.0,83000.0,0,2
9,0,37.0,67000.0,1,2


In [40]:
normalized_df=(df-n.min())/(n.max()-n.min())

In [41]:
df[numerical] = normalized_df[numerical].values

In [42]:
df

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,0.8125,0.738095,0,0.0
1,2,0.28125,0.166667,1,0.0
2,1,0.375,0.309524,0,0.5
3,2,0.625,0.47619,0,1.0
4,1,0.6875,-0.97619,1,0.0
5,0,0.53125,0.404762,1,0.5
6,2,-0.5625,0.261905,0,1.0
7,0,0.9375,0.904762,1,0.0
8,1,1.0,1.0,0,0.5
9,0,0.59375,0.619048,1,0.5
