# Exploring Your Data - Lab

## Introduction 

In this lab you'll perform a exploratory data analysis task, using statistical and visual EDA skills. You'll continue using the Lego dataset that you've acquired and cleaned in the previous labs. 

## Objectives
You will be able to:

* Check the distribution of various columns
* Examine the descriptive statistics of our data set
* Create visualizations to help us better understand our data set

## Data Exploration

At this point, you've already done a modest amount of data exploration between investigating the initial database to further exploring individual features while cleaning things up in preparation for modeling. During this process, you've become more familiar with the particular idiosyncrasies of the dataset. This gives you an opportunity to uncover difficulties and potential pitfalls in working with the dataset as well as potential avenues for feature engineering that could improve the predictive performance of your model down the line. Remember that this is also not a linear process; after building an initial model, you might go back and continue to mine the dataset for potential inroads to create additional features and improve the model's performance if initial results did not satisfy your needs and expectations. Here, you'll continue this process, investigating the distributions of some of the various features and their relationship to the target variable: `list_price`.

### Load the dataset 'Lego_dataset_cleaned.csv'  and Check its Contents 

In [1]:
#Your code here
import pandas as pd
df = pd.read_csv('Lego_dataset_cleaned.csv')

### Describe the dataset using 5 point statistics and record your observations

In [2]:
#Your code here
df.describe()

Unnamed: 0,piece_count,list_price,num_reviews,play_star_rating,star_rating,val_star_rating,ages_10+,ages_10-14,ages_10-16,ages_10-21,...,country_NZ,country_PL,country_PT,country_US,review_difficulty_Average,review_difficulty_Challenging,review_difficulty_Easy,review_difficulty_Very Challenging,review_difficulty_Very Easy,review_difficulty_unknown
count,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,...,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0,10870.0
mean,2.2878560000000002e-17,67.309137,1.3400300000000001e-17,3.505388e-14,2.523956e-13,-1.584433e-13,0.049126,0.001932,0.013615,0.016927,...,0.046274,0.04333,0.044618,0.066421,0.308832,0.091536,0.351978,0.001932,0.083257,0.162466
std,1.0,94.669414,1.0,1.0,1.0,1.0,0.216141,0.043913,0.115894,0.129005,...,0.210088,0.203609,0.206474,0.249029,0.462033,0.288384,0.477609,0.043913,0.276282,0.368894
min,-0.6050659,2.2724,-0.4264402,-5.883334,-5.641909,-5.193413,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.4895715,21.899,-0.3705846,-0.48101,-0.4602216,-0.3650101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.3379852,36.5878,-0.2868011,0.2160641,0.1615809,0.1178302,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.06263593,73.1878,-0.1192341,0.5646012,0.7833834,0.6006705,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
max,8.466055,1104.87,9.795146,1.087407,0.990651,1.244458,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Use pandas histogram plotting to plot histograms for all the variables in the dataset

In [None]:
#Your code here
import matplotlib.pyplot as plt
%matplotlib inline
df.hist(figsize=(20,15))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f61f22fb898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f1aea080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f1f0f550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f11b3198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f10a7b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f10b2d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f10bcfd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f004a3c8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f004a400>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61f0079e80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f61f001f438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61effc59b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f61effeef28>,
        <matplotlib.axes._subplots.Ax

Note how skewed most of these distributions are. While linear regression does not assume that each of the individual predictors are normally distributed, it does assume a linear relationship between the predictors and the target variable (list_price in this case). To further investigate if this assumption holds true, you can plot some single variable regression plots of each feature against the target variable using seaborn.

## Check for Linearity

Recall that one assumption in linear regression is that the target variable is linearly related to the input features. As shown in the previous lesson, you can use the `sns.jointplot()` function to investigate whether this relation holds true for the various predictors on hand.

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

In [None]:
#Your code here

## Comments

Well, at first look it appears that the previous efforts in order to fill in the null review values proved of little value. Perhaps this was due to imputing the mean, but as it currently stands, each of the rating features seems to have little to no predictive power for the upcoming model.

## Checking for Multicollinearity

It's also important to make note of whether your predictive features will result in multicollinearity in the resulting model. While definitive checks for multicollinearity require analyzing the resulting model, predictors with overly high pairwise-correlation (r^2 > .65) are almost certain to produce multicollinearity in a model. With that, take a minute to generate the pairwise [pearson] correlation coefficients of your predictive features and visualizes these coefficients as a heatmap.

In [None]:
#Your code here

In [None]:
#Your code here

> Comments: The rating features show little promise for adding predictive power towards the list_price. This diminishes worry concerning their high correlation. That said, the two most promising predictors: piece_count and num_reviews also display fairly high correlation. Further analysis of an initial model will clearly be warranted.

## Further Resources

Have a look at following resources on how to deal with complex datasets that don't meet our initial expectations. 

[What to Do When Bad Data Thwarts Machine Learning Success](https://towardsdatascience.com/what-to-do-when-bad-data-thwarts-machine-learning-success-fb82249aae8b)

[Practical advice for analysis of large, complex data sets ](http://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html)

[Data Cleaning Challenge: Scale and Normalize Data](https://www.kaggle.com/rtatman/data-cleaning-challenge-scale-and-normalize-data)

## Summary 

In this lesson you performed some initial EDA onto check for regression assumptions. In the upcoming lessons, you'll continue to carry out a standard data science process and begin to fit and refine an initial model.