### Task Structured Tabular Data:

#### Dataset Link:
Dataset can be found at " /data/structured_data/data.csv " in the respective challenge's repo.

#### Description:
Tabular data is usually given in csv format (comma-separated-value). CSV files can be read and manipulated using pandas and numpy library in python. Most common datatypes in structured data are 'numerical' and 'categorical' data. Data processing is required to handle missing values, inconsistent string formats, missing commas, categorical variables and other different kinds of data inadequacies that you will get to experience in this course. 

#### Objective:
How to process and manipulate basic structured data for machine learning (Check out helpful links section to get hints)

#### Tasks:
- Load the csv file (pandas.read_csv function)
- Classify columns into two groups - numerical and categorical. Print column names for each group.
- Print first 10 rows after handling missing values
- One-Hot encode the categorical data
- Standarize or normalize the numerical columns

#### Ask yourself:

- Why do we need feature encoding and scaling techniques?
- What is ordinal data and should we one-hot encode ordinal data? Are any better ways to encode it?
- What's the difference between normalization and standardization? Which technique is most suitable for this sample dataset?
- Can you solve the level-up challenge: Complete all the above tasks without using scikit-learn library ?

#### Helpful Links:
- Nice introduction to handle missing values: https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
- Scikit-learn documentation for one hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- Difference between normalization and standardization: https://medium.com/towards-artificial-intelligence/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

In [14]:
%matplotlib inline
import numpy as np
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

READ AND SHOW CSV FILE

In [4]:
df = pd.read_csv("/Users/regatte/Desktop/challenge-week-1/data/structured_data/data.csv")
df.describe()

Unnamed: 0,Age,Salary,Price Category Of Purchase
count,18.0,17.0,20.0
mean,34.222222,60364.705882,2.0
std,9.194343,11799.202366,0.858395
min,18.0,41000.0,1.0
25%,27.25,54000.0,1.0
50%,35.0,58800.0,2.0
75%,39.5,67000.0,3.0
max,50.0,83000.0,3.0


column names

In [6]:
for i in df.columns:
    print(i)

Country
Age
Salary
Purchased
Price Category Of Purchase


handling missing data

In [9]:
ndf = df.dropna(axis = 0, how ='any')
df.describe()

Unnamed: 0,Age,Salary,Price Category Of Purchase
count,18.0,17.0,20.0
mean,34.222222,60364.705882,2.0
std,9.194343,11799.202366,0.858395
min,18.0,41000.0,1.0
25%,27.25,54000.0,1.0
50%,35.0,58800.0,2.0
75%,39.5,67000.0,3.0
max,50.0,83000.0,3.0


first 10 rows after handling missing values

In [36]:
ndf = ndf.drop('Country', 1)
ndf.head(10)

Unnamed: 0,Age,Salary,Purchased,Price Category Of Purchase
0,44.0,72000.0,No,1
1,27.0,48000.0,Yes,1
2,30.0,54000.0,No,2
3,38.0,61000.0,No,3
5,35.0,58000.0,Yes,2
7,48.0,79000.0,Yes,1
8,50.0,83000.0,No,2
9,37.0,67000.0,Yes,2
10,18.0,54400.0,No,3
11,22.0,55000.0,Yes,3


ONE HOT ENCODE

In [39]:
x = ndf.iloc[:,:].values
labelencoder_x = LabelEncoder()
x[:,2] = labelencoder_x.fit_transform(x[:,2])
y = pd.DataFrame(x)

In [40]:
print(y)
#the yes and no are represented in form of numbers

     0      1  2  3
0   44  72000  0  1
1   27  48000  1  1
2   30  54000  0  2
3   38  61000  0  3
4   35  58000  1  2
5   48  79000  1  1
6   50  83000  0  2
7   37  67000  1  2
8   18  54400  0  3
9   22  55000  1  3
10  28  42000  0  3
11  24  41000  0  2
12  35  69000  0  1
13  32  67000  1  3
14  38  65000  1  3


normalising and standardising

In [42]:
ndf.columns

Index(['Age', 'Salary', 'Purchased', 'Price Category Of Purchase'], dtype='object')

In [43]:
x_data = ndf[['Age', 'Salary']]
y_data = ndf[['Price Category Of Purchase']]

In [44]:
x_data = x_data.apply(lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis = 0)))

In [46]:
x_data

Unnamed: 0,Age,Salary
0,0.8125,0.738095
1,0.28125,0.166667
2,0.375,0.309524
3,0.625,0.47619
5,0.53125,0.404762
7,0.9375,0.904762
8,1.0,1.0
9,0.59375,0.619048
10,0.0,0.319048
11,0.125,0.333333
