# Interactive exercise: Data analysis I and data-preprocessing

## 1. Data points, data features, and data types

The scikit-learn libary comes with a few small standard datasets. One of the dataset is *Boston house prices dataset*.

In this dataset, each **data point** describes the situation of a Boston suburb or town. The **input data** includes some **data attributes** related to the place, and the **output data** is the house price (median value of owner-occupied homes in the unit of 1000 USD) at the place.

The name for each attribute is abbreivated as several capital letters to save space. The following list describes the abbreviations:

- CRIM: per capita crime rate by town

- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS: proportion of non-retail business acres per town

- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX: nitric oxides concentration (parts per 10 million)

- RM: average number of rooms per dwelling

- AGE: proportion of owner-occupied units built prior to 1940

- DIS: weighted distances to five Boston employment centres

- RAD: index of accessibility to radial highways

- TAX: full-value property-tax rate per $10,000

- PTRATIO: pupil-teacher ratio by town

- B: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town

- LSTAT: % lower status of the population

You do not have to understand the specific meanings of each of the attributes. You only need to know that each of them represent some properties related to a place in Boston. Let's see what the dataset looks like!

Firstly, let's look at the input data

In [None]:
import numpy as np
import warnings
#Please ignore the above line, as it is irrelevant for the purpose of the course
from sklearn.datasets import load_boston
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    dataset= load_boston()
#This above section basically loads the dataset from the scikit-learn library
    
dataset_input,dataset_output,dataset_feature_names=dataset["data"],dataset["target"],dataset["feature_names"]

print("The type of the input data is:",dataset_input.dtype,"\n")

print("The input data is:\n", dataset_input,"\n")

print("The shape of the input data is:",dataset_input.shape)

*float64* is a special data type from the numpy library, which is analogous to the built-in python data type *float*. For the purpose of this course, you can regard each element in the table (array), such as *6.3200e-03*, as a ordinary **float**. The *e* symbol means power. For example, *6.3200e-03* means ${6.32} \times {10}^{-3}$.

There are some dots ($\ldots$) inside the table. (add explanations about what the dots mean)

**Problem 1:**

> Each column of the table (each element along axis 1 of the array) represents a <u>____</u>. <br> A. Data type  B. Data attribute  C. Data point  D. Dataset

**Problem 2:**

> The shape of the array is (506,13), which correponds to 
> - 506 <u>____</u> 
> - and 13 <u>____</u>. 
> <br> A. Data type  B. Data attribute  C. Data point  D. Dataset <br> 


The display of the table is simplied and you are unable to view the most of the elements in the table. You can view the hidden elements by specifying the row of the table (the index along axis 0 of the array). For example, to see the data point correponding to the 236th location:



In [None]:
print(dataset_input[235])

**Problem 3:**
> What is the data point correponding to the 457th location?

The order of the data attributes (columns) are arranged in the following order:

In [None]:
print("The order of the attributes are: \n", dataset_feature_names)

For example

- The 1st column of the table represents the attribute 'CRIM', which means per capita crime rate by town.
- The 2nd column of the table repsesents the attribute 'ZN', which means proportion of residential land zoned for lots over 25,000 sq.ft.
- The last column of the table represents the attribute 'LSTAT', which means % lower status of the population.

**Problem 4:**
> What does the 9th column of the table mean?

With this information, you can now understand what each element in the input data repesent. For example, to get the proportion of owner-occupied units built prior to 1940 at the 236th location, simply retrieve the 7th element of the data point. The value is:

In [None]:
print(dataset_input[236-1][7-1])

**Problem 5:**
> What is the index of accessibility to radial highways at the 154th location?

Now, let's move on the output data.

In [None]:
print("The type of the output data is:",dataset_input.dtype,"\n")

print("The output data is: \n", dataset_output,"\n")

print("The shape of the output data is:", dataset_output.shape)

Similarly, the data type is also float.

The shape of the array is (506,), which is analagous to a 1-dimensional list of 506 elements.

Each element of the list represents a data point.

For example, the house price (median value of owner-occupied homes in the unit of 1000 USD) at the 236th place is:

In [None]:
print(dataset_output[236-1])

**Problem 6:**
> What is the house price at the 347th place?

## 2. Statistical descriptions of data

For data preprocessing to be successful, it is essential to have an overall picture of your
data. Basic statistical descriptions can be used to identify properties of the data.

### 2.1. Central tendency 

To find the mean of each data attribute in the input data



In [None]:
print(np.mean(dataset_input, axis=0))

Similarly, to find the mean of each data attribute in the input data

In [None]:
print(np.median(dataset_input, axis=0))

### 2.1. Dipersion

To find the variance of each data attribute in the input data

In [None]:
print(np.var(dataset_input, axis=0))

Similarly, to find the standard deviation of each data attribute in the input data

In [None]:
print(np.std(dataset_input, axis=0))

## 3. Encodindg

Through this section, you are going to learn and try some of the most commonly used encoding techniques.As Kaggle competition deals with encoding a lot it would be a great time to refresh some the most common and effective encoding techniques currently in use.

First, import the dataset

In [None]:
import pandas as pd

df_train=pd.read_csv('../input/cat-in-the-dat/train.csv')
df_test=pd.read_csv('../input/cat-in-the-dat/test.csv')

print('train data set has got {} rows and {} columns'.format(df_train.shape[0],df_train.shape[1]))
print('test data set has got {} rows and {} columns'.format(df_test.shape[0],df_test.shape[1]))

X=df_train.drop(['target'],axis=1)
y=df_train['target']
#X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.2)

df_train.head()

### 3.1. Label encoding

In this method you change every categorical data to a number.That is each type will be subtuted by a number.for example we will substitute 1 for Grandmaster, 2 for master , 3 for expert etc..

To implement this:

In [None]:
from sklearn.preprocessing import LabelEncoder


train=pd.DataFrame()
label=LabelEncoder()
for c in  X.columns:
    if(X[c].dtype=='object'):
        train[c]=label.fit_transform(X[c])
    else:
        train[c]=X[c]
        
train.head(3)    

### 3.2. One hot encoding

The second method is encoding each category as a one hot encoding vector (or dummy variables). It is is a representation method that takes each category value and turns it into a binary vector of size |i|(number of values in category i) where all columns are equal to zero besides the category column. Here is a little example:   

![](https://miro.medium.com/max/878/1*WXpoiS7HXRC-uwJPYsy1Dg.png)

To implement this:

In [None]:
from sklearn.preprocessing import OneHotEncoder

one=OneHotEncoder()

one.fit(X)
train=one.transform(X)

print(train)

print('train data set has got {} rows and {} columns'.format(train.shape[0],train.shape[1]))

### 3.3 Target encoding

Target-based encoding is numerization of categorical variables via target. In this method, we replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding probability of the target (if categorical) or average of the target (if numerical). The main drawbacks of this method are its dependency to the distribution of the target, and its lower predictability power compare to the binary encoding method.

for example,
<table style="width : 20%">
    <tr>
    <th>Country</th>
    <th>Target</th>
    </tr>
    <tr>
    <td>India</td>
    <td>1</td>
    </tr>
    <tr>
    <td>China</td>
    <td>0</td>
    </tr>
    <tr>
    <td>India</td>
    <td>0</td>
    </tr>
    <tr>
    <td>China</td>
    <td>1</td>
    </tr>
    </tr>
    <tr>
    <td>India</td>
    <td>1</td>
    </tr>
</table>


Encoding for India = [Number of true targets under the label India/ Total Number of targets under the label India] 
which is 2/3 = 0.66

<table style="width : 20%">
    <tr>
    <th>Country</th>
    <th>Target</th>
    </tr>
    <tr>
    <td>India</td>
    <td>0.66</td>
    </tr>
    <tr>
    <td>China</td>
    <td>0.5</td>
    </tr>
</table>


To implement this:

In [None]:
X_target=df_train.copy()
X_target['day']=X_target['day'].astype('object')
X_target['month']=X_target['month'].astype('object')
for col in X_target.columns:
    if (X_target[col].dtype=='object'):
        target= dict ( X_target.groupby(col)['target'].agg('sum')/X_target.groupby(col)['target'].agg('count'))
        X_target[col]=X_target[col].replace(target).values

X_target.head(4)

## 4. Data cleaning

In this tutorial, you will learn three approaches to **dealing with missing values**. Then you'll compare the effectiveness of these approaches on a real-world dataset.

There are many ways data can end up with missing values. For example,
- A 2 bedroom house won't include a value for the size of a third bedroom.
- A survey respondent may choose not to share his income.

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So you'll need to choose one of the strategies below.

The code below defines a dataset for this section:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Select target
y = data.Price

# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

Then, let’s define a metrics to measure the quality of data cleaning. Lower score represents better quality.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### 4.1 Deletion of data points

The simplest option is to drop columns with missing values. 

![tut2_approach1](https://i.imgur.com/Sax80za.png)

Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach.  As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!

For example, to delete the data points in the melbourne housing dataset:

In [None]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

### 4.2 Data imputation

**Imputation** fills in the missing data points.  Common methods include filling out the missing data points with

- Manully decided values
- A global constant, such as 0 or 1
- Central tendency

For instance, we can fill in the mean value along each column. 

![tut2_approach2](https://i.imgur.com/4BpnlPA.png)

The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

For example, to fill in the missing data points in the melbourne housing dataset with mean values:

In [None]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

## 5. Balancing

An unbalanced dataset is one in which the data points has more entries in one specific class than the others in the output data. For example, an output data that describes the gender of the students in a class can be quite balanced, but an output data that describes the frauds in transations made by highly trusted agents cab be highly unbalanced:

![bal vs unbal](https://miro.medium.com/max/1400/1*miAWYUJ7sgWaRHCZMdP2OQ.png)

Balancing an unblanaced dataset would generate higher accuracy. For example, in the output data of the breast cancer dataset below, 0 represents that the breast cancer is benign and 1 represents the cancer is malignant 



In [None]:
from sklearn.datasets import load_breast_cancer

dataset_2=load_breast_cancer()

print(dataset_2.target)

To balance the dataset, first check the distribution of each class

In [None]:
values,counts = np.unique(dataset_2.target, return_counts=True)

print(np.asarray((values, counts)).T)