### Task Structured Tabular Data:

#### Dataset Link:
Dataset can be found at " /data/structured_data/data.csv " in the respective challenge's repo.

#### Description:
Tabular data is usually given in csv format (comma-separated-value). CSV files can be read and manipulated using pandas and numpy library in python. Most common datatypes in structured data are 'numerical' and 'categorical' data. Data processing is required to handle missing values, inconsistent string formats, missing commas, categorical variables and other different kinds of data inadequacies that you will get to experience in this course. 

#### Objective:
How to process and manipulate basic structured data for machine learning (Check out helpful links section to get hints)

#### Tasks:
- Load the csv file (pandas.read_csv function)
- Classify columns into two groups - numerical and categorical. Print column names for each group.
- Print first 10 rows after handling missing values
- One-Hot encode the categorical data
- Standarize or normalize the numerical columns

#### Ask yourself:

- Why do we need feature encoding and scaling techniques?
- What is ordinal data and should we one-hot encode ordinal data? Are any better ways to encode it?
- What's the difference between normalization and standardization? Which technique is most suitable for this sample dataset?
- Can you solve the level-up challenge: Complete all the above tasks without using scikit-learn library ?

#### Helpful Links:
- Nice introduction to handle missing values: https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
- Scikit-learn documentation for one hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- Difference between normalization and standardization: https://medium.com/towards-artificial-intelligence/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

In [54]:
# Import the required libraries
# Use terminal commands like "pip install numpy" to install packages
import numpy as np
import pandas as pd
# import sklearn if and when required

In [55]:
data = pd.read_csv("data/structured_data/data.csv")

In [56]:
data.head()

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,France,44.0,72000.0,No,1
1,Spain,27.0,48000.0,Yes,1
2,Germany,30.0,54000.0,No,2
3,Spain,38.0,61000.0,No,3
4,Germany,40.0,,Yes,1


In [57]:
data.dtypes

Country                        object
Age                           float64
Salary                        float64
Purchased                      object
Price Category Of Purchase      int64
dtype: object

In [58]:
integer=[]
string=[]
fl=[]
for i in data:
    if(data[i].dtype=='int64'):
        integer.append(i)
    elif(data[i].dtype=='float64'):
        fl.append(i)
    else:
        string.append(i)

In [59]:
print("Integer Type :- ", integer)
print("Float Type :- ", fl)
print("String Type :- ", string)

Integer Type :-  ['Price Category Of Purchase']
Float Type :-  ['Age', 'Salary']
String Type :-  ['Country', 'Purchased']


In [60]:
data.isnull().sum()

Country                       0
Age                           2
Salary                        3
Purchased                     0
Price Category Of Purchase    0
dtype: int64

In [61]:
data['Age']=data['Age'].replace(np.NaN, int(data['Age'].mean()))

In [62]:
data['Salary']=data['Salary'].replace(np.NaN, data['Salary'].mean())

In [63]:
data.head(10)

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,France,44.0,72000.0,No,1
1,Spain,27.0,48000.0,Yes,1
2,Germany,30.0,54000.0,No,2
3,Spain,38.0,61000.0,No,3
4,Germany,40.0,60364.705882,Yes,1
5,France,35.0,58000.0,Yes,2
6,Spain,34.0,52000.0,No,3
7,France,48.0,79000.0,Yes,1
8,Germany,50.0,83000.0,No,2
9,France,37.0,67000.0,Yes,2


In [64]:
from sklearn.preprocessing import OneHotEncoder

In [65]:
enc = OneHotEncoder(handle_unknown='ignore')

In [66]:
from sklearn import preprocessing

In [67]:
le = preprocessing.LabelEncoder()

In [68]:
le.fit(data.Country)

LabelEncoder()

In [69]:
le.classes_

array(['France', 'Germany', 'Spain'], dtype=object)

In [70]:
data.Country=le.transform(data.Country)

In [71]:
data.head()

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,44.0,72000.0,No,1
1,2,27.0,48000.0,Yes,1
2,1,30.0,54000.0,No,2
3,2,38.0,61000.0,No,3
4,1,40.0,60364.705882,Yes,1


In [72]:
le.fit(data.Purchased)

LabelEncoder()

In [73]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [74]:
data.Purchased=le.transform(data.Purchased)

In [75]:
data.head()

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,44.0,72000.0,0,1
1,2,27.0,48000.0,1,1
2,1,30.0,54000.0,0,2
3,2,38.0,61000.0,0,3
4,1,40.0,60364.705882,1,1


In [76]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 


In [77]:
data

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,44.0,72000.0,0,1
1,2,27.0,48000.0,1,1
2,1,30.0,54000.0,0,2
3,2,38.0,61000.0,0,3
4,1,40.0,60364.705882,1,1
5,0,35.0,58000.0,1,2
6,2,34.0,52000.0,0,3
7,0,48.0,79000.0,1,1
8,1,50.0,83000.0,0,2
9,0,37.0,67000.0,1,2


In [78]:
data_scaled_Age = scaler.fit_transform(data,['Age'])

In [79]:
data_scaled_Salary = scaler.fit_transform(data,['Salary'])

In [80]:
data_scaled_Price = scaler.fit_transform(data,['Price Category Of Purchase'])

In [81]:
data["Age"]=data_scaled_Age

In [82]:
data["Salary"]=data_scaled_Salary

In [83]:
data["Price Category Of Purchase"]=data_scaled_Price

In [84]:
data

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,-1.074172,-1.074172,0,-1.074172
1,2,1.790287,1.790287,1,1.790287
2,1,0.358057,0.358057,0,0.358057
3,2,1.790287,1.790287,0,1.790287
4,1,0.358057,0.358057,1,0.358057
5,0,-1.074172,-1.074172,1,-1.074172
6,2,1.790287,1.790287,0,1.790287
7,0,-1.074172,-1.074172,1,-1.074172
8,1,0.358057,0.358057,0,0.358057
9,0,-1.074172,-1.074172,1,-1.074172
