# Model Development Challenge: Data Analysis

## Data Representation (Visualization)

### Representation

Regardless of the dataset that you are working with, the features will likely be a combination of both numerical and categorical variables values.  Python is capable of reading in, handling, and manipulating both types of data.  The data can even be a singlular or set of images.  It is important to keep in mind that just because your dataset contains different types of values or even different features with the same type of values doesn't mean that you have to use all of the data in your analysis. 

This is where ***representation*** comes into play.  You can select a subset of your data to represent your dataset as a whole.  There are different ways to do this, however one of the most common and straightforward ways is to simply use only your numeric values.  In the case where you want to use just a few of many features that you have in your dataset, data visualization can be very useful.  Just through simple visual inspection of your data you can determine which of your features are correlated, predictive of an output, etc.

In [46]:
import numpy
import pandas
import toyplot
import scipy.stats
from sklearn.datasets import load_diabetes

In [31]:
# Load the Diabetes dataset, all features have already been standardized
data = toyplot.data.cars()
col_names = list(data.keys())
data = pandas.DataFrame(data, columns=col_names)
data = data.drop("Model", axis=1)
data

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Year,Origin
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0
...,...,...,...,...,...,...,...,...
387,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0
388,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0
389,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0
390,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0


In [33]:
for x in data.columns[1:]:
    canvas = toyplot.Canvas(500,500)
    axes = canvas.cartesian(label=str(x)+" vs MPG", xlabel=str(x), ylabel="MPG")
    mark = axes.scatterplot(data[x], data['MPG'])

## Descriptive Statistics

Descriptive statistics are used to identify hidden patterns in your data.  They describe the data through statistics without making any predictions or inferences.  Descriptive statistics typically involve the computation of the mean, median, mode, variance, and standard deviation.  These basic statistics can give you insights to the distribution and spread of your data as well as the potential presence of outliers/anomalies.

In [43]:
mean = data.mean()
print("The mean for each column is:")
print(mean)
print()
median = data.median()
print("The median for each column is:")
print(median)
print()
mode = data.mode().values
print("The mode for each column is:")
print(mode)
print()
var = data.var()
print("The variance for each column is:")
print(var)
print()
std = data.std()
print("The standard deviation for each column is:")
print(std)
print()

The mean for each column is:
MPG               23.445918
Cylinders          5.471939
Displacement     194.411990
Horsepower       104.469388
Weight          2977.584184
Acceleration      15.541327
Year              75.979592
Origin             1.576531
dtype: float64

The median for each column is:
MPG               22.75
Cylinders          4.00
Displacement     151.00
Horsepower        93.50
Weight          2803.50
Acceleration      15.50
Year              76.00
Origin             1.00
dtype: float64

The mode for each column is:
[[1.300e+01 4.000e+00 9.700e+01 1.500e+02 1.985e+03 1.450e+01 7.300e+01
  1.000e+00]
 [      nan       nan       nan       nan 2.130e+03       nan       nan
        nan]]

The variance for each column is:
MPG                 60.918142
Cylinders            2.909696
Displacement     10950.367554
Horsepower        1481.569393
Weight          721484.709008
Acceleration         7.611331
Year                13.569915
Origin               0.648860
dtype: float64

Th

## Inferential Statistics

Inferential statistics are used to extract inferences or hypotheses from a sample of large data. Prediction of the dependent variable based on the independent variable is carried out in inferential statistics.  Inferential statistics often involves various tests and scores relating to probability and distributions.  Generally speaking, inferential statistics looks at similarities and differences between samples of data.  Common tests and scores include:

- z-Score
- t-Test
- z-Test
- f-Test

In [49]:
sample = data.sample(n=40)

# Perform t-Test on total data and subsample of data

t_stat, p_val = scipy.stats.ttest_ind(a=data, b=sample, equal_var=False)

print("The p-values are: ", p_val)

The p-values are:  [0.6750935  0.47281334 0.30186942 0.66290149 0.51242424 0.78163762
 0.53783305 0.99065369]


## Drawing Conclusions

Next to the analysis itself, drawing conclusions based on the analysis is the most important part of the data analysis process.  More times than not, your data analysis will not directly answer the questions you have about the data, but they will provide results and insights that are either reasons to accept or reject your hypotheses.

***Based on the analysis that you've gone through in this lesson, what are some key conclusions that you can draw? What do the various phases tell you about the features that should be used to represent this dataset? What trends and/or correlations do you see between the inputs and target variables?***