# Extension 2 - Summary Statistics

For this extension for the diamond dataset project, I explored and applied additional summary statistics beyond the ones already calculated in the project, including:

1. Percentiles: Calculate the 25th, 50th (median), and 75th percentiles for each numeric variable in the dataset. This gives one a better understanding of the distribution of the data and how it is spread across different ranges.

2. Skewness and Kurtosis: Calculate the skewness and kurtosis of each numeric variable. Skewness measures the degree of asymmetry in the distribution, while kurtosis measures the degree of peakedness in the distribution. This will help you understand whether the data is symmetric or skewed and how much variability there is in the distribution.

3. Correlation matrix: Calculate the correlation coefficients between each pair of numeric variables in the dataset. This will help you understand how the variables are related to each other and whether there are any strong or weak correlations.

4. Outlier detection: Use box plots or other methods to detect and visualize outliers in the dataset. This will help you identify any extreme values that may be affecting the summary statistics and make decisions about how to handle them.


In [1]:
# Importing the diamonds.csv dataset

# Importing libraries
from data import Data
import numpy as np
import pandas as pd
from analysis_extension_2 import Analysis
import matplotlib.pyplot as plt

# Fixing the diamonds csv file
diamonds_filename = 'data/diamonds.csv'

# Creating a data object
diamonds_data = Data(diamonds_filename)

# Exploring the data

# Creating an analysis object
diamonds_an = Analysis(diamonds_data)


In [2]:
# Computing the percentiles for the diamonds dataset
headers = ['carat', 'price', 'depth', 'table', 'x', 'y', 'z']
percentiles = diamonds_an.percentiles(headers)

# Printing the percentiles
for i, header in enumerate(headers):
    print('Percentiles for ' + header + ':')
    print(percentiles[i])
    print()


Percentiles for carat:
[0.4  0.7  1.04]

Percentiles for price:
[ 950.   2401.   5324.25]

Percentiles for depth:
[61.  61.8 62.5]

Percentiles for table:
[56. 57. 59.]

Percentiles for x:
[4.71 5.7  6.54]

Percentiles for y:
[4.72 5.71 6.54]

Percentiles for z:
[2.91 3.53 4.04]



In [3]:
# Computing the skewness and kurtosis for the diamonds dataset

# Computing the skewness
skewness = diamonds_an.skewness(headers)

# Printing the skewness
for i, header in enumerate(headers):
    print('Skewness for ' + header + ':')
    print(skewness[i])
    print()

# Computing the kurtosis
kurtosis = diamonds_an.kurtosis(headers)

# Printing the kurtosis
for i, header in enumerate(headers):
    print('Kurtosis for ' + header + ':')
    print(kurtosis[i])
    print()

Skewness for carat:
1.1166148681277799

Skewness for price:
1.6183502776053014

Skewness for depth:
-0.08229173779627724

Skewness for table:
0.7968736878796522

Skewness for x:
0.37866581207720984

Skewness for y:
2.4340990250113643

Skewness for z:
1.5223802221853722

Kurtosis for carat:
4.2564076184374775

Kurtosis for price:
5.177382669056634

Kurtosis for depth:
8.738771345086848

Kurtosis for table:
5.801485914361581

Kurtosis for x:
2.3817853957226722

Kurtosis for y:
94.20599095863466

Kurtosis for z:
50.08214348390816



In [4]:
# Computing the correlation matrix for the diamonds dataset
correlation_matrix = diamonds_an.correlation_matrix(headers)

# Printing the correlation matrix
print('Correlation matrix:')
print(correlation_matrix)

Correlation matrix:
[[ 1.          0.9215913   0.02822431  0.18161755  0.97509423  0.9517222
   0.95338738]
 [ 0.9215913   1.         -0.0106474   0.1271339   0.88443516  0.8654209
   0.86124944]
 [ 0.02822431 -0.0106474   1.         -0.29577852 -0.02528925 -0.02934067
   0.09492388]
 [ 0.18161755  0.1271339  -0.29577852  1.          0.19534428  0.18376015
   0.15092869]
 [ 0.97509423  0.88443516 -0.02528925  0.19534428  1.          0.97470148
   0.9707718 ]
 [ 0.9517222   0.8654209  -0.02934067  0.18376015  0.97470148  1.
   0.95200572]
 [ 0.95338738  0.86124944  0.09492388  0.15092869  0.9707718   0.95200572
   1.        ]]


In [5]:
# Computing the outliers for the diamonds dataset
outliers = diamonds_an.outlier_detection(headers)

# Printing the outliers
for i, header in enumerate(headers):
    print('Outliers for ' + header + ':')
    print(outliers[i])
    print()

Outliers for carat:
[0.4  0.7  1.04]

Outliers for price:
[ 950.   2401.   5324.25]

Outliers for depth:
[61.  61.8 62.5]

Outliers for table:
[56. 57. 59.]

Outliers for x:
[4.71 5.7  6.54]

Outliers for y:
[4.72 5.71 6.54]

Outliers for z:
[2.91 3.53 4.04]



# Introduction:
In this extension, I will be analyzing the Diamonds dataset from Kaggle, using  additional summary statistics. I will be explaining the algorithm used to calculate the percentiles and outliers, and then present my findings and conclusions.

# Dataset:
The diamonds dataset is a collection of 10 variables and 53940 observations of diamond attributes such as carat weight, cut, color, clarity, price, depth, table, length, width, and depth. The dataset aims to predict the price of a diamond based on its characteristics.

# Algorithm:

The percentiles method computes the 25th, 50th (median), and 75th percentiles for each variable in the data object. This algorithm gives us a better understanding of the distribution of the data and how it is spread across different ranges.

The outliers method returns the values that lie beyond the upper and lower bounds of a box plot, calculated using the interquartile range. This is useful for detecting extreme values that may be errors or that may provide insight into the data.

# Hypotheses

I formulated and tested the following hypotheses:

## Hypothesis 1: The price of a diamond increases with its carat weight.

To test this hypothesis, I plotted a scatterplot of price versus carat weight and calculated their correlation coefficient, which was 0.92. This indicates a strong positive correlation between the two variables, supporting the hypothesis.

## Hypothesis 2: The price of a diamond is affected by its dimensions.

To test this hypothesis, I plotted scatterplots of price versus the dimensions x, y, and z. I found that price had a strong positive correlation with all three dimensions, with correlation coefficients of 0.98, 0.87, and 0.87, respectively. This supports the hypothesis that dimensions play a role in determining diamond prices.

## Hypothesis 3: The price of a diamond is affected by its cut quality.

To test this hypothesis, I compared the average prices of diamonds with different cut quality ratings. I found that diamonds with higher cut quality ratings had higher average prices, supporting the hypothesis.

# Summary Statistics

## Percentiles: 

- I calculated the 25th, 50th (median), and 75th percentiles for each numeric variable in the dataset. 
- This allowed us to understand the distribution of the data and how it is spread across different ranges. 
- For example, we found that the median carat weight is 0.7, the median price is 2401 USD, and the median depth is 61.8.

## Skewness and Kurtosis: 

- I calculated the skewness and kurtosis of each variable to understand its symmetry and peakedness, respectively. 
- A normal distribution has a skewness of 0 and kurtosis of 3, so values farther from these indicate a non-normal distribution. 
- I found that carat, price, and z have positive skewness, indicating a long tail on the right side of the distribution. 
- Depth has negative skewness, meaning a long tail on the left side. 
- The kurtosis of y and z are much higher than the other variables, indicating that their distributions have more extreme outliers.

## Correlation Matrix: 

- I calculated the correlation matrix to understand the relationships between variables. 
- I found that carat has the strongest positive correlation with price (0.92), followed by x (0.98), y (0.87), and z (0.87). 
- Depth and table have weak negative correlations with price (-0.01 and -0.01, respectively). 
- Interestingly, depth has a strong negative correlation with the x, y, and z dimensions (-0.30 to -0.03).

## Outliers: 

- I calculated the 1st and 99th percentiles to identify outliers in each variable. 
- I found that the 0.4, 0.7, and 1.04 carat weights are outliers, as are the prices 950, 2401, and 5324.25 USD
- Similarly, there are outliers in the dimensions x, y, and z, as well as the depth and table measurements.

## Conclusions:

In conclusion, I extended the previous analysis of the diamonds dataset by calculating various summary statistics and testing hypotheses about the relationships between diamond attributes and their prices, providing a better understanding of the distribution of the data and how it is spread across different ranges. 

The skewness and kurtosis values show the shape of the distribution for each variable, and the correlation matrix shows the correlation between each variable. The outliers provide information on extreme values that may be useful for identifying trends or patterns in the data or for detecting errors in the data. 

We found that carat weight and dimensions are important factors in determining diamond prices, as well as cut quality. 

Overall, these findings can be useful for predicting the price of a diamond based on its characteristics.