#**Data Analysis**
<font color='grey' size='1.5'> Created by Parisa Hosseinzadeh for *Machine learning for proteins*, Spring 2022. The examples are adapted from [hands-on tutorial of chapter 2](https://colab.research.google.com/github/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb#scrollTo=yWpx5Wa71o58)of [hands-on machine learning book](https://www.amazon.ca/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646).

In this simple activity, we will learn practice of working with data.

## Setting up

Let's add all the necessary modules and libraries. Let's also mount google drive here.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "W1L2"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Mounting google drive
google_drive_mount_point = '/content/google_drive'

import os, sys, time

if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount(google_drive_mount_point)

if not os.getenv("DEBUG"):
    google_drive = google_drive_mount_point + '/My Drive' 

Mounted at /content/google_drive


## Loading data

We will use *housig.csv* data for this exercise. 

In [None]:
# load data as csv
# Replace <Folder> with the correct name of the folder for example MyDrive/BioEML
housing = pd.read_csv('/content/google_drive/MyDrive/<Folder>/housing.csv')

In [None]:
# taking a look at the first few lines of the file
housing.head()

## Basic analysis/visualization

Let's perform some basic analysis and visualization on our dataset.

In [None]:
# Let's take a look at one of the columns
housing['longitude'].hist(bins=50, figsize=(20,15))
save_fig("longitutde")
plt.show()

In [None]:
# calculating mean, median, standard deviation
mean = housing['longitude'].mean()
std = housing['longitude'].std()
median = housing['longitude'].median()
range = housing['longitude'].max() - housing['longitude'].min()
print('For the variable longitude,',
      '\n the mean is:', mean,
      '\n the median is:', median,
      '\n the standard deviation is:', std,
      '\n the range is:', range)

#### Time to practice

Plot the distribution and claculate the values listed above for *latitude*, *population*, and *total_rooms* and submit to your in-class activity.

In [None]:
#empty code cell for you

### Checking everything at once

In [None]:
# Looking at everything at once
housing.describe()

In [None]:
# Visualizing all plots
housing.hist(bins=50, figsize=(20,15))

In [None]:
# To look at the values of a non-numerical column, you can use the following
housing["ocean_proximity"].value_counts()

## Cleaning up data

Now let's take a look at how to clean up the data and prepare it for use.

### Dealing with data with missing value

Let's check and see what we should do if we have missing values (NaN) in some of our columns. 

#### Deleteing missing values

One option is to drop any rows that have NaN values. This of course will result in loss of data

In [None]:
# checking the current length
len(housing)

In [None]:
# removing the na
cleaned_housing = housing.dropna(subset=["total_bedrooms"]) 

#### Exercise time

Check what is the size after removing the rows with no value.

In [None]:
# check the new lines


#### Replacing missing values

Another option is to replace the missing values with something else. One of the most common replacements is median of other values.

In scikit learn, you can use [imputers](https://scikit-learn.org/stable/modules/impute.html) to perform this claculation. In this test case, we're using *median* as a strategy for replacement.

In [None]:
# Let's take a look at rows with missing values
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

In [None]:
# Let's set the imputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [None]:
# we need to remove the categorical variables because 
# median doesn't work for them
housing_num = housing.drop("ocean_proximity", axis=1)

In [None]:
# fitting
imputer.fit(housing_num)

In [None]:
# showing the calculated median values
imputer.statistics_

In [None]:
# Performing transformation to replace columns
X = imputer.transform(housing_num)

In [None]:
# generating transformed dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

In [None]:
# let's look at the rows that had missing values
housing_tr.loc[sample_incomplete_rows.index.values]

In [None]:
# seeing if the rest of data is intact
housing_tr.head()

#### Practice time

Repeat the same process with "most_frequent" strategy. What is the number you get this time? 

In [None]:
# for your code

### Scaling

As mentioned in the class, it's often more efficient to scale all the data to make sure they have similar ranges and variations.

This often happens after other steps (cleaning, etc) are done.

Today, we will use scikit's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to acheive this purpose.

In [None]:
# we need to remove the categorical variables here
housing_num = housing.drop("ocean_proximity", axis=1)

In [None]:
# let's take a look at the first rows before scaling
housing_num.head()

In [None]:
# fitting data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(housing_num)

In [None]:
# transforming
X = scaler.transform(housing_num)
housing_sc = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

#### Time to practice

Take a look at the first rows of the new dataframe. Compare the new distributions (histograms) to the old one.

In [None]:
# your code

### Handling text/categorical data

Now let's take a look at how we can manage text data or categorical data. Let's take a look at one of these data.

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.value_counts()

[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) in scikit learn takes in a list and give numbers based on each unique object in the list, changing text/category to numbers.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
# let's look at the categories it found
ordinal_encoder.categories_

As mentioned in class, there are some challenges with using numbers as categories, as the relation between numbers can imply some relation between data that does not exist. To avoid that, we can one-hot encode categories using scikit's [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). 

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
# by default, this returns a sparse array
# why do you think?
housing_cat_1hot

In [None]:
# let's take a look at the full array
housing_cat_1hot.toarray()

In [None]:
# let's see if we found categories correctly
cat_encoder.categories_

### Pipelines *(optional)* 

It is often advized to generate a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that performs all your transformation and cleaning for you. This way, you can easily run this code consistently everytime you want to use it. 

Let's take a look at how this is done for this case.

In [None]:
from sklearn.pipeline import Pipeline

# preparing numerical data
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

In [None]:
from sklearn.compose import ColumnTransformer

#preparing categorical data
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

# preparing data
housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
# let's look at the final prepared dataframe
housing_prepared.shape

(20640, 14)

## Creating a test set

In this segment, we will practice generating a train/test split.

In [None]:
# we can use scikit for splitting test and train
# Let's see how it looks
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(
    housing, test_size=0.2, random_state=42)

#### Time to practice

How much is the size of test and training set?
You can use the function **len**. len(X) where X is a list shows the number of members in the list, aka size.

In [None]:
# Empty code cell for you

### What's a good split

Let's take a look at one of the variables to see if we had a good split.

In [None]:
housing["median_income"].hist()

In [None]:
# Let's say we want to create income categories that are 
# based on median income. From the histogram above, we chose these cut-offs
# to get an even distribution of data among categories
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].hist()

### Stratified split

Let's take a look at another type of splitting here. We use sklearn's [StratifiedShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) function which preserves the percentage of samples in each class.

In [None]:
# splitting using scikit learn
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# it's splitting the IDs and then we get locations based on IDs.
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
# comparing the two models:
# this functions takes in the the number of data in income_cat
# normalized by total data
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)


# calculates the proportion of data in random test set compared to the 
# overall test set for each category
compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

#### Time to practice

Print out the *compare props* data frame to see which one of the two methods better represent the actual data distribution. Note that the values show the number of members in that category normalized by the total number.

In [None]:
# for you to run