# Programming for Data Analysis

## Project 2 - Diamonds dataset simulation

For this project you must create a data set by simulating a real-world phenomenon of
your choosing. You may pick any phenomenon you wish – you might pick one that is
of interest to you in your personal or professional life. Then, rather than collect data
related to the phenomenon, you should model and synthesise such data using Python.
We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:
• Choose a real-world phenomenon that can be measured and for which you could
collect at least one-hundred data points across at least four different variables.
• Investigate the types of variables involved, their likely distributions, and their
relationships with each other.
• Synthesise/simulate a data set as closely matching their properties as possible.
• Detail your research and implement the simulation in a Jupyter notebook – the
data set itself can simply be displayed in an output cell within the notebook.
Note that this project is about simulation – you must synthesise a data set. Some
students may already have some real-world data sets in their own files. It is okay to
base your synthesised data set on these should you wish (please reference it if you do),
but the main task in this project is to create a synthesised data set. The next section
gives an example project idea.

### Example project idea
As a lecturer I might pick the real-world phenomenon of the performance of students
studying a ten-credit module. After some research, I decide that the most interesting
variable related to this is the mark a student receives in the module - this is going to be
one of my variables (grade).

Upon investigation of the problem, I find that the number of hours on average a
student studies per week (hours), the number of times they log onto Moodle in the
first three weeks of term (logins), and their previous level of degree qualification (qual)
are closely related to grade. The hours and grade variables will be non-negative real
number with two decimal places, logins will be a non-zero integer and qual will be a
categorical variable with four possible values: none, bachelors, masters, or phd.

After some online research, I find that full-time post-graduate students study on average four hours per week with a standard deviation of a quarter of an hour and that
a normal distribution is an acceptable model of such a variable. Likewise, I investigate
the other four variables, and I also look at the relationships between the variables. I
devise an algorithm (or method) to generate such a data set, simulating values of the
four variables for two-hundred students. I detail all this work in my notebook, and then
I add some code in to generate a data set with those properties.


### Problem statement

The purpose of this project is to create a dataset by simulating a real-world phenomenon. Then, rather than collect data related to this phenomenon, such data will be synthesised using Python.
As a real-world phenomenon I've choosen an existing dataset from this resource: [Diamonds dataset](https://www.kaggle.com/shivam2503/diamonds) <br> This dataset contains the prices and other attributes of almost 54,000 diamonds.

### **Content of the initial dataset**:

- **Price** - diamond price in US dollars in range (326-18823)
- **Carat** - weight of diamond (0.2 - 5.01)
- **Cut** - quality of the diamond's cut (Fair, Good, Very good, Premium, Ideal)
- **Clarity** - measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- **Length** - length in mm (0 - 10.74)
- **Width** - width in mm (0 - 58.9)
- **Depth** -depth in mm (0--31.8)
- **Table** -  width of top of diamond relative to widest point (43 - 95)

It was decided for this project's purpose to focus on the Ideal diamond cut only. The following variables will be simulated: 
- price
- carat
- length
- width
- depth

The size of the simulated dataset will be fixed to 20000 samples (data points).

Let's take a look at the summary of the initial dataset, where diamonds cut is *Ideal*:

In [None]:
# Importing the libraries
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

In [None]:
# Importing the tips dataset from the local file
df = pd.read_csv("diamonds_ideal.csv")

# view first 5 rows
df.head()

First, we can use .describe() to view some descriptive facts of the dataset:

In [None]:
df.describe()

Here's a quick breakdown of the above as it relates to this particular dataset:

- _**count**:_ there are 21543 rows in the dataset.
- _**mean**:_ the average value of the variable.
- _**std**:_ the standard deviation. Standard Deviation tells how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are more spread out. 
- _**min**:_ the smallest value in the list of the particular variable
- _**25%**:_ the 25th percentile. Shows that 25% of the data are below this value.
- _**50%**:_ the median, this is the measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values.
- _**75%**:_ the 75th percentile. Shows that 75% of the data are below this value.
- _**max**:_ the highest value.

From the descriptive statistics we can see that the initial dataset has 21543 records. The price of the diamonds varies between 326 and 18806 USD. The weight of these diamonds is in range 0.2 - 3.5 carats. The length is similar to its width and varies between 3.7 and 9.6 mm. The range for diamonds depth is between 2.3 and 6.0 mm.

After analysis of the descriptive statistics we can come up with some rules for our simulated dataset:
- price: non-zero positive integer. To be more precised with the simulation, the minimm price of a diamond shouldn't be below 300 USD
- carat: non-zero positive number with two decimal places. Again we will set the min weight for our diamond as per initial dataset - 0.2
- lenght: non-zero positive number with two decimal places. The lower boundary is 3.76 mm
- width: non-zero positive number with two decimal places. The lower boundary is
- depth: non-zero positive number with two decimal places. The lower boundary is

For easier calculation our simulated dataset will contain 20000 records.

Also we can visually represent the distribution of the variables using a `histogram`.

Histogram below shows frequency distribution for a Price and Carat in an initial dataset.

In [None]:
fig1 = plt.figure(figsize=(20,6))
fig1.subplots_adjust(top=0.85, wspace=0.3)

# 1 row, 3 columns, and we'd like the first element.
ax1 = plt.subplot(1, 2, 1)
sns.distplot(df['price'], ax=ax1, color="rebeccapurple", bins = 10)

# 1 row, 2 columns, and we'd like the second element.
ax2 = plt.subplot(1, 2, 2)
sns.distplot(df['carat'], ax=ax2, color="rebeccapurple", bins = 10)

Now let's plot the measurements of the diamonds (Length, Width and Depth):

In [None]:
fig2 = plt.figure(figsize=(20,6))
fig2.subplots_adjust(top=0.85, wspace=0.3)

# 1 row, 3 columns, and we'd like the first element.
ax3 = plt.subplot(1, 3, 1)
sns.distplot(df['length'], ax=ax3, color="rebeccapurple", bins = 10)

# 1 row, 2 columns, and we'd like the second element.
ax4 = plt.subplot(1, 3, 2)
sns.distplot(df['width'], ax=ax4, color="rebeccapurple", bins = 10)

# 1 row, 2 columns, and we'd like the third element.
ax5 = plt.subplot(1, 3, 3)
sns.distplot(df['depth'], ax=ax5, color="rebeccapurple", bins = 10)

## Price

In this section we will be looking closer at the first measure - price and the way how we are going to synthesise it.

Looking at the Price histogram above, we can notice that the histogram is in a shape of the gamma-distribution. `Gamma distribution` is a right skewed distribution used for continuous variables. This is due to its flexibility in the choice of the shape and scale parameters. The scale parameter determines where the bulk of the observations lies and the shape parameter determines how the distribution will look. E.g.:

In [None]:
gamma = np.random.gamma(1, 2000, 20000)
sns.distplot(gamma, color="rebeccapurple")

The graph above is just random gamma-distribution, and here is our histogram of the price:

In [None]:
sns.distplot(df['price'], color="rebeccapurple", bins = 70)

The first plot is an example of the ideal gamma-distribution plotting. Even though that we can clearly see some similarities with the second plot, they are not the same. To make it more realistic and close to our initial price distribution, we will add some random numbers to our gamma-distribution dataset using numpy.random.normal function

In [None]:
s_price = np.random.gamma(1.2, 150, 20000).astype(np.int)+ np.random.normal(0.0, 6000, 20000).astype(np.int)

As per our set of the requirements for the simulated price, the price cannot be below 300 USD. Let's check what we got:

In [None]:
print('Minimum price: ' + str(np.amin(s_price).round(2)))

Now let's transform all prices that are below 300 USD range into the prices in range [300 - 2500], as looking at the initial price histogram, we can clearly see that the most of the prices are distribuded within this range.

In [None]:
s_price[s_price < 300] = np.random.uniform(300, 2500, len(s_price[s_price < 300]))

Let's check the minimum price again - now it shouldn't be below 300 USD:

In [None]:
print('Minimum price: ' + str(np.amin(s_price).round(2)))

Time to visualize our simulated price:

In [None]:
sns.distplot((s_price), color="rebeccapurple", bins = 70)

If we won't take into consideration a huge spike of the low prices distribution in the initial Price diagram, we got quite simular plot with smoother boudaries. Let's check what is the mean and the standard deviation of the simulated price:

In [None]:
sim_mean_p = np.mean(s_price).astype(np.int)
sim_std_p = np.std(s_price).astype(np.int)

print('Mean: ' + str(sim_mean_p))
print('Standard deviation: ' + str(sim_std_p))

If we compare the Mean and Standard deviation of the simulated price with the actual price (Mean = 3457, Std = 3808), we could see that there's only a small difference in these values.

Therefore we can assume that our simulated price is close enough to the actual data.

## Length

Now we can go to the next variable - diamond length. We have already looked at the Price variable and have simulated this measure. Let's see if there is any correlation between the length and the price. For this purpose we will use seaborn plotting capabilities.

In [None]:
plt.scatter(df['price'], df['length'], color="rebeccapurple")

From the plot above we can clearly see that price depends on the length. 

Let's try to fit a best line into this correlation. For this purpose we will be using `scipy.optimize.curve_fit`. As the straigh line won't fit our correlation graph, we will be fitting a logarithmic curve (`y = a + b * log(x)`), wich should suit our purpose much better than the straight line.

In [None]:
x_p = df['price']
y_l = df['length']

scipy.optimize.curve_fit(lambda t, a, b: a + b*np.log(t),  x_p,  y_l)

The first array contains **a** and **b** values (-2.36435806 and 1.03049703, respectively), that we're going to use to build our best fit line. Let's put our **a** and **b** values that we got earlier into a logarithmic curve formula: y = a + b * log(x). The forlmula for **y** values of the best fit line is: `y ≈ -2.36435806 + 1.03049703 * log(x)`

Now we can update our correlation plot with the best fit line.

In [None]:
plt.figure()

# Plot price versus length with black dots.
plt.plot(x_p, y_l, 'mo', label = "Original Data")

# Overlay the best fit line on the plot.
# provide the limits for x1 (min price & max price)
x = np.arange(326, 18806, 1)
plt.plot(x, -2.36435806 + 1.03049703 * np.log(x), 'k-', label=r"Best fit line")
plt.legend()
plt.show()

Based on the graph, we are ok to take the best fit line to simulate the diamond length values. We will be using formula of the best fit line and adding some random numbers from **random.uniform** distribution:

In [None]:
s_length = (-2.36435806 + 1.03049703 * np.log(s_price)) + np.random.normal(0.0, 0.7, s_price.size)

Let's check the minimum and maximum values we got in our simulted lenght:

In [None]:
print('Minimum length: ' + str(np.amin(s_length).round(2)))
print('Maximum length: ' + str(np.amax(s_length).round(2)))

If we check the min and max for our initial lenght, we can notice that the frames of the simulated lenght are wider.

In [None]:
df['length'].describe()

Therefore we will add the boudaries by putting the folowing rule in place:
- if the length is lower than 3.76 replace it with any random number from the normal distribution (with a mean = 4 and standard deviation = 0.1)

In [None]:
s_length[s_length < 3.76] = np.random.normal(4, 0.1, len(s_length[s_length < 3.76]))

Let's check the minimum length again:

In [None]:
print('Minimum length: ' + str(np.amin(s_length).round(2)))

Now looks much better. Let's compare the Mean and a Standard Deviation of the initial length and the simulated:

In [None]:
sim_mean_l = np.mean(s_length).round(2)
sim_std_l = np.std(s_length).round(2)

print('Mean of the simulated length: ' + str(sim_mean_l))
print('Standard deviation of the simulated length: ' + str(sim_std_l))

The itial mean is 5.51 and standard deviation is 1.06. The value of simulated figures are close to the initial ones.

## Width

Let's take a look at the next variable - Diamond Width and its correlation with the Diamond Length using seaborn regplot.

In [None]:
sns.regplot('width', 'length', data=df, color="rebeccapurple")

From the graph above we can clearly see the positive linear correlation between length and width. The way *the best fit line* fits the model, we can tell that the width is equal the length (as the graph shows that y = x). In this case we will simulate the width data by taking simulated length and adding some small random numbers from **random.normal** distribution.

In [None]:
s_width = s_length + np.random.normal(0.1, 0.05, s_length.size)

Let's calculate the min, max, mean and standard deviation of the simulated width.

In [None]:
print('Minimum width: ' + str(np.amin(s_width).round(2)))
print('Maximum width: ' + str(np.amax(s_width).round(2)))

sim_mean_w = np.mean(s_width).round(2)
sim_std_w = np.std(s_width).round(2)

print('Mean of simulated width: ' + str(sim_mean_w))
print('Standard deviation of simulated width: ' + str(sim_std_w))

The comparison of min, max, mean and standard deviation is almost the same as for initial width dataset:

In [None]:
df['width'].describe()

Let's plot two graphs showing correlation between the length and width in the initial dataset and in the simulated:

In [None]:
fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Length and width correlation", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Initial")
ax1.set_xlabel("Length")
ax1.set_ylabel("Width") 
#sns.kdeplot(red_wine['sulphates'], ax=ax1, shade=True, color='r')
sns.regplot('length', 'width', data=df, ax=ax1, color="rebeccapurple")

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Simulation")
ax2.set_xlabel("Length")
ax2.set_ylabel("Width") 
sns.regplot(s_length, s_width, ax=ax2, color="rebeccapurple")

Apart two outliers in the initial dataset, both plots looks almost the same.

# Depth

The next variable that we are going to look at and simulate is Diamond Depth. Let's check if there is any correlation between the depth and the length of the diamonds.

In [None]:
plt.scatter(df['depth'], df['length'], color="rebeccapurple")

The graph above also shows a positive linear correlation between the length and the depth. But in this case we see that y is not equal x. Let's try to fit *a best fit line*. First we need to calculate the slope(**m**) and y-intercept(**c**):

In [None]:
l = df['length']
d = df['depth']

# First calculate the means of w and d.
l_avg = np.mean(l)
d_avg = np.mean(d)

# Subtract means from w and d.
l_zero = l - l_avg
d_zero = d - d_avg

# The best m is found by the following calculation.
m = np.sum(l_zero * d_zero) / np.sum(l_zero * l_zero)

# Use m from above to calculate the best c.
c = d_avg - m * l_avg

print("m is %8.6f and c is %6.6f." % (m, c))

We can calculate slope and y-intercept using another method - `sns.polyfit` function. Let's check if we get the same numbers:

In [None]:
np.polyfit(l, d, 1)

As you can see the slope and y-intercept values are exactly the same. Now we are going to plot our line (`y = 0.61527133 * x + 0.01297997`):

In [None]:
#Plot w versus d with black dots.
plt.plot(l, d, 'm.', label="Data")

# Overlay some lines on the plot.
x = np.arange(3.5, 10.5, 1.0)
plt.plot(x, m * x + c, 'k-', label=r"Best fit line")


# Add a legend.
plt.legend()

# Add axis labels.
plt.xlabel('Length')
plt.ylabel('Depth')

# Show the plot.
plt.show()

Now we will simulate the depth data by using our best fit line formula plus some random noise using **random.normal** distribution.

In [None]:
s_depth = (0.61527133 * (s_length) + 0.01297997) + np.random.normal(0.0, 0.05, s_length.size)

Calculating min, max, mean and standard deviation for the simulated length:

In [None]:
print('Minimum depth: ' + str(np.amin(s_depth).round(2)))
print('Maximum depth: ' + str(np.amax(s_depth).round(2)))

sim_mean_d = np.mean(s_depth).round(2)
sim_std_d = np.std(s_depth).round(2)

print('Mean of simulated depth: ' + str(sim_mean_d))
print('Standard deviation of simulated depth: ' + str(sim_std_d))

If we compare the values to the initial dataset (below), we will see that the difference is minimal.

In [None]:
df['depth'].describe()

 Let's visualize correlation between length and depth (initial vs simulated):

In [None]:
fig = plt.figure(figsize=(10,4))
title = fig.suptitle("Length and depth correlation", fontsize=14)
fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)
ax1.set_title("Initial")
ax1.set_xlabel("Length")
ax1.set_ylabel("Depth") 
#sns.kdeplot(red_wine['sulphates'], ax=ax1, shade=True, color='r')
sns.regplot('length', 'depth', data=df, ax=ax1, color="rebeccapurple")

ax2 = fig.add_subplot(1,2,2)
ax2.set_title("Simulation")
ax2.set_xlabel("Length")
ax2.set_ylabel("Depth") 
sns.regplot(s_length, s_depth, ax=ax2, color="rebeccapurple")

These two plots looks very similar (excluding few outliers in the Initial sub-plot).

# Carat

## distribution plots are not the same

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(df['carat'], color="rebeccapurple", bins = 10)

In [None]:
df['carat'].describe()

Let’s see how we can estimate the diamond weight (carats) by looking at its measurements. There is a formula to calculate the diamond estimate weight:
Estimated weight = Lenght * Width * Depth * Coefficient

Coefficient usually varies between 0.0057 & 0.0066 depending on the diamond shape. To create our simulated carat values, we will be using simulated length (s_length), simulated width (s_width) and simulated depth (s_depth). To get the coefficient values we will be using the random.uniform distribution in the range [0.0057 - 0.0066]

In [None]:
# for coefficient calculation is used random.uniform distribution in range between 0.0057 & 0.0066
s_carat = s_length * s_width * s_depth * np.random.uniform(0.0057, 0.0066, s_length.size)

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(s_carat, color="rebeccapurple", bins = 10)

In [None]:
print('Minimum carat: ' + str(np.amin(s_carat).round(2)))
print('Maximum carat: ' + str(np.amax(s_carat).round(2)))

sim_mean_c = np.mean(s_carat).round(2)
sim_std_c = np.std(s_carat).round(2)

print('Mean of simulated carat: ' + str(sim_mean_c))
print('Standard deviation of simulated carat: ' + str(sim_std_c))

Comparing min, max, mean and standard deviation of the simulated carats versus initial measurements, we see that there is no big variance between two measurement sets.

# Testing dependancies

In [None]:
plt.scatter(df['carat'], df['width'], color="rebeccapurple")

In [None]:
plt.scatter(s_carat, s_width, color="rebeccapurple")

# wrong, doesn work as required

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(df['carat'], color="rebeccapurple", bins = 100)

In [None]:
df['carat'].describe()

In [None]:
s_carat = np.random.poisson(1, 20000)/2 + np.random.normal(0.0, 0.1, s_length.size)

In [None]:
s_carat[s_carat<0.2] = np.random.uniform(0.25, 0.35, len(s_carat[s_carat<0.2]))

In [None]:
print('Minimum carat: ' + str(np.amin(s_carat).round(2)))
print('Maximum carat: ' + str(np.amax(s_carat).round(2)))

sim_mean_c = np.mean(s_carat).round(2)
sim_std_c = np.std(s_carat).round(2)

print('Mean of simulated carat: ' + str(sim_mean_c))
print('Standard deviation of simulated carat: ' + str(sim_std_c))

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(s_carat, color="rebeccapurple", bins = 100)

# Carat done

In [None]:
plt.scatter(df['price'], df['carat'], color="rebeccapurple")

In [None]:
plt.scatter(s_price, s_carat, color="rebeccapurple")

# Stops here

In [None]:
plt.plot(s_length, s_depth, 'k.', label="Data")

In [None]:
plt.scatter(df['length'], df['depth'], color="rebeccapurple")

In [None]:
np.std(s_width).round(2)

In [None]:
x = np.random.normal(0.0, 0.1, s_length.size)

In [None]:
np.amin(x)

In [None]:
np.amax(x)

In [None]:
np.mean(x)

In [None]:
np.std(x)

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(s_length, color="rebeccapurple", bins = 50)

## WEIGHT AND DEPTH

In [None]:
w = df['width']
d = df['depth']

Finding the best fir line (calculating slope (m) and y intercept (c))

In [None]:
# Calculate the best values for m and c.

# First calculate the means of w and d.
w_avg = np.mean(w)
d_avg = np.mean(d)

# Subtract means from w and d.
w_zero = w - w_avg
d_zero = d - d_avg

# The best m is found by the following calculation.
m = np.sum(w_zero * d_zero) / np.sum(w_zero * w_zero)
# Use m from above to calculate the best c.
c = d_avg - m * w_avg

print("m is %8.6f and c is %6.6f." % (m, c))

In [None]:
np.polyfit(w, d, 1)

In [None]:
#Plot w versus d with black dots.
plt.plot(w, d, 'k.', label="Data")

# Overlay some lines on the plot.
x = np.arange(3.0, 9.5, 1.0)
plt.plot(x, m * x + c, 'r-', label=r"Best fit line")


# Add a legend.
plt.legend()

# Add axis labels.
plt.xlabel('Width')
plt.ylabel('Depth')

# Show the plot.
plt.show()

In [None]:
sns.regplot(x="width", y="depth", data=df)

In [None]:
pois= np.random.exponential(0.5, s_price.size)

In [None]:
print(np.amin(pois))
print(np.amax(pois))

In [None]:
sns.distplot(pois, color="rebeccapurple", bins = 15)

In [None]:
sns.set_style("whitegrid")
#fig = plt.figure(figsize=(10,7))
#fig.add_subplot(1,1,1)
sns.distplot(df['carat'], color="rebeccapurple", bins = 50)

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot((s_length), color="rebeccapurple", bins = 10)

In [None]:
log_data = np.log(s_length)

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot((log_data), color="rebeccapurple", bins = 10)

In [None]:
test = np.random.gamma(3, 1, 20000)

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot((test), color="rebeccapurple")

In [None]:
log_data = np.log(data)

In [None]:
sim_mean_p = np.mean(s_price).astype(np.int)
sim_std_p = np.std(s_price).astype(np.int)

print('Mean: ' + str(sim_mean_p))
print('Standard deviation: ' + str(sim_std_p))

# Stops here

In [None]:
df['length'].describe()

In [None]:
df['width'].describe()

In [None]:
sns.set_style("whitegrid")
#fig = plt.figure(figsize=(10,7))
#fig.add_subplot(1,1,1)
sns.distplot(df['length'], color="rebeccapurple", bins = 10)

In [None]:
sns.set_style("whitegrid")
#fig = plt.figure(figsize=(10,7))
#fig.add_subplot(1,1,1)
sns.distplot(df['depth'], color="rebeccapurple", bins = 20)

If we check the mean, standard deviation and other paramenters from .describe(), we can notice that length and width has almost the same values, therefore are distributed identically.

Let's look at the correlation between these two attributes.

In [None]:
plt.scatter(df['length'], df['width'], color="rebeccapurple")

From the plot above we can see that there is a perfect linear correlation between lenght and weight. Let's figure out the best fit line.

In [None]:
w = df['width']
l = df['length']

In [None]:
# Calculate the best values for m and c.

# First calculate the means of w and d.
l_avg = np.mean(l)
w_avg = np.mean(w)

# Subtract means from w and d.
l_zero = l - l_avg
w_zero = w - w_avg

# The best m is found by the following calculation.
m = np.sum(l_zero * w_zero) / np.sum(l_zero * l_zero)
# Use m from above to calculate the best c.
c = w_avg - m * l_avg

print("m is %8.6f and c is %6.6f." % (m, c))

Now we got a function y = 0.995319 * x + 0.036870

In [None]:
np.polyfit(l, w, 1)

The

In [None]:
np.polyfit(w, l, 1)

## PRICE AND LENGHT

In [None]:
plt.scatter(df['depth'], df['length'], color="rebeccapurple")

In [None]:
sns.regplot('length', 'width', data=df)

In [None]:
p = df['price']
l = df['length']

Finding the best fir line (calculating slope (m) and y intercept (c))

In [None]:
# Calculate the best values for m and c.

# First calculate the means of w and d.
p_avg = np.mean(p)
l_avg = np.mean(l)

# Subtract means from w and d.
p_zero = p - p_avg
l_zero = l - l_avg

# The best m is found by the following calculation.
m_pl = np.sum(p_zero * l_zero) / np.sum(p_zero * l_zero)
# Use m from above to calculate the best c.
c_pl = l_avg - m_pl * p_avg

print("m is %8.6f and c is %6.6f." % (m_pl, c_pl))

In [None]:
np.polyfit(p, l, 2)

In [None]:
B, A = np.polyfit(p, np.log(l), 1) 

In [None]:
B

In [None]:
A

In [None]:
B*x

In [None]:
A*np.e

In [None]:
y = A*np.e**(B*x)

In [None]:
y

In [None]:
#-1.80750353e-08 * x**2 + 5.01767929e-04 * x +  4.25142800e+00

In [None]:
#Plot w versus d with black dots.
#plt.plot(p, l, 'k.', label="Data")

# Overlay some lines on the plot.
x = np.arange(0, 25000, 2500)
plt.plot(x, A*np.e**(B*x), 'r-', label=r"Best fit line")
#plt.plot(x, np.sqrt(x) - 140, 'r-', label=r"Best fit line")
#y = Ae^(Bx)

# Add a legend.
plt.legend()

# Add axis labels.
plt.xlabel('Price')
plt.ylabel('Length')

# Show the plot.
plt.show()

In [None]:
s_price

In [None]:
df['depth'].describe()

In [None]:
df['carat'].describe()

In [None]:
chi_sqr = np.random.noncentral_chisquare(1.6, 0.1, 20000)*1000

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(chi_sqr)

In [None]:
sim_mean_p = np.mean(chi_sqr)
sim_std_p = np.std(chi_sqr)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p))

In [None]:
chi_sqr = np.random.noncentral_chisquare(3.8, .01, 10000)*1000
fig = plt.figure(figsize=(10,7))
sns.distplot(chi_sqr)

In [None]:
sim_mean_p = np.mean(chi_sqr)
sim_std_p = np.std(chi_sqr)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p))

In [None]:
df['price'].describe()

In [None]:
noise = np.random.normal(0.0, 200, chi_sqr.shape)

In [None]:
noise

In [None]:
gm = np.random.gamma(2.0, 2400, 10000)
fig = plt.figure(figsize=(10,7))
sns.distplot(gm)

# testng gamma and chi square

In [None]:
chi_sqr = np.random.noncentral_chisquare(2.5, 0.00001, 20000)*1000

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(chi_sqr)

In [None]:
sim_mean_p = np.mean(chi_sqr)
sim_std_p = np.std(chi_sqr)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p))

In [None]:
chi_sqr = np.random.noncentral_chisquare(3.8, 1, 10000)*1000 + np.random.normal(0.0, 500, 10000)

In [None]:
chi_sqr[chi_sqr<340] =  np.random.uniform(340, 18806 + 1, len(chi_sqr[chi_sqr<340]))

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(chi_sqr)

In [None]:
sim_mean_p = np.mean(chi_sqr)
sim_std_p = np.std(chi_sqr)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p))

In [None]:
df['price'].describe()

In [None]:
gamma = np.random.gamma(2.0, 2300, 10000) + np.random.normal(0.0, 500, gamma.shape)

In [None]:
gamma[gamma<340] = np.random.uniform(340, 18806 + 1, len(gamma[gamma<340]))

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(gamma)

In [None]:
sim_mean_p1 = np.mean(gamma)
sim_std_p1 = np.std(gamma)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p1))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p1))

# end

In [None]:
df['price'].describe()

In [None]:
gamma = np.random.gamma(2.3, 2000, 10000) + np.random.normal(0.0, 200, gamma.shape)
fig = plt.figure(figsize=(10,7))
sns.distplot(gamma)

In [None]:
sim_mean_p1 = np.mean(gamma)
sim_std_p1 = np.std(gamma)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p1))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p1))

In [None]:
df['carat'].describe()

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
sns.distplot(df['carat'], color="rebeccapurple", bins = 20, 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 2})

In [None]:
df['depth'].describe()

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
sns.distplot(df['depth'], color="rebeccapurple")

In [None]:
df['length'].describe()

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
sns.distplot(df['length'], color="rebeccapurple")

In [None]:
df['width'].describe()

In [None]:
sns.set_style("whitegrid")
sns.distplot(df['width'], color="rebeccapurple")

In [None]:
plt.scatter(df['price'], df['carat'], color="rebeccapurple")

In [None]:
plt.scatter(df['width'], df['depth'], color="rebeccapurple")

In [None]:
plt.scatter(df['width'], df['length'], color="rebeccapurple")

In [None]:
#plt.axis([0, 6, 0, 20])
plt.scatter(df['length'], df['depth'], color="rebeccapurple")

In [None]:
plt.scatter(df['price'], df['length'], color="rebeccapurple")

In [None]:
plt.scatter(df['price'], df['width'], color="rebeccapurple")

In [None]:
plt.scatter(df['price'], df['depth'], color="rebeccapurple")

In [None]:
plt.scatter(df['price'], df['carat'], color="rebeccapurple")

In [None]:
plt.scatter(df['carat'], df['length'], color="rebeccapurple")

In [None]:
plt.scatter(df['carat'], df['width'], color="rebeccapurple")

In [None]:
plt.scatter(df['carat'], df['depth'], color="rebeccapurple")

In [None]:
#set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
sns.distplot(df['width'], color="rebeccapurple", bins = int(180/5), 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})

In [None]:
plt.hist(df['width'], color = 'blue', edgecolor = 'black', bins = int(180/5))

In [None]:
sns.distplot(df['width'], bins = int(180/5))

# Testing cases

In [None]:
from scipy.stats import expon
from scipy.stats import poisson

In [None]:
data_expon = expon.rvs(scale=1,loc=0,size=1000)

fig = plt.figure(figsize=(10,7))
ax = sns.distplot(data_expon, bins=100)
ax.set(xlabel='Exponential Distribution', ylabel='Frequency')

In [None]:
g_shape, g_scale, g_size = 1, 2, 100000
gamma = np.random.gamma(g_shape, g_scale, g_size)
sns.distplot(gamma)

In [None]:
gamma = np.random.gamma(2.0, 2200, 10000) + np.random.normal(0.0, 500, 10000)

In [None]:
gamma[gamma<340] = np.random.randint(340, 18806 + 1)
#gamma

In [None]:
sim_mean_p1 = np.mean(gamma)
sim_std_p1 = np.std(gamma)

print('Mean of the simulated diamond length values: ' + str(sim_mean_p1))
print('Standard deviation of the simulated diamond length values: ' + str(sim_std_p1))

In [None]:
df['price'].describe()

In [None]:
#gammat = np.random.gamma(2.0, 2000, 10000)
fig = plt.figure(figsize=(10,7))
sns.distplot(gamma)

In [None]:
# Importing the tips dataset from the local file
df = pd.read_csv("diamonds_full.csv")
df.head()

In [None]:
df1 = df.loc[df.loc[:, 'cut'] == 'Ideal']

In [None]:
df1.head()

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(10,7))
fig.add_subplot(1,1,1)
sns.distplot(df1['price'], color="rebeccapurple", bins = 100)

In [None]:
df1['price'].describe()

In [None]:
chi_sqr = np.random.noncentral_chisquare(2.5, 0.00001, 21551)*1000

In [None]:
fig = plt.figure(figsize=(10,7))
sns.distplot(chi_sqr)

In [None]:
np.polyfit(l, w, 2)

In [None]:
from pylab import *
from scipy.optimize import curve_fit

In [None]:
x = df['length']
y = df['price']

In [None]:
x = np.array(df['length'])
x

In [None]:
from pylab import *
from kapteyn import kmpfit

## References:

https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.optimize.curve_fit.html