# Machine Learning Development - Data Collection/Exploration

In this module, we will discuss how we are going to develop the custom AI model for prediction following the proper lifecycle. 

To better understand and learn how to apply machine learning to create solutions, we will apply the concepts and knowledge we have learnt from the past notebooks towards the creation of a custom machine learning model that aims to do predict the price of a diamond based on its features.

## Problem and Resources

Our task for this scenario would be to build a model that will predict the price of a diamond based on their individual features. These features are the following: *(1) price, (2) carat, (3) cut, (4) color, (5) clarity, (6) x/length, (7) y/width, (8) z/depth, (9) total depth percentage, and (10) table*.

The model will be trained from a total of 53,940 samples of diamonds compiled in a single dataset of CSV file format.

## Data Collection

As of now we will skip this step as we already have an available dataset publicly available from Kaggle. This dataset named "Diamonds" contains 53,940 samples of diamonds containing unique features that will be used to properly train our AI model. 

Credits must be given for its creator, Shivam Agrawal, for providing us this publicly available [Kaggle dataset](https://www.kaggle.com/datasets/shivam2503/diamonds) free to use for both aspiring machine learning engineers and developers for experimentation and learning.

## Data Exploration

### Introduction

To properly explore our data, we will be using the following Python libraries/modules to describe, modify, or visualize our data. Official links to their documentations are also available to serve as an additional reference of learning as to how you may be able to further take advantage their features for advanced exploration.
-  [Pandas Library](https://pandas.pydata.org/docs/) to prove high-performance, easy-to-use data structures and data analysis tools for our development.
-  [Matplotlib Library](https://matplotlib.org/stable/index.html) creating static, animated, and interactive visualizations for our development.
-  [Seaborne Library](https://seaborn.pydata.org/) provides a high-level interface for drawing attractive and informative statistical graphics based from the Matplotlib library.

### Code

*The first step in data exploration is to fetch the dataset itself, which is for this case in a CSV file format, and convert it into a Pandas DataFrame so we can be able access, modify, and visualize it. Upon conversion we will then check the characteristics and properties of the DataFrame.*

In [3]:
# Import the Pandas library
import pandas as pd

# Fetch CSV dataset and convert it into a DataFrame
dataset_directory = 'Resources/diamonds.csv'
dataframe = pd.read_csv(dataset_directory)

In [4]:
# Check characteristics/properties of DataFrame
print("Data Type: " + str(type(dataframe)))
print("Dataset Size: " + str(len(dataframe)))
print("Dataset Shape (Rows, Columns): " + str(dataframe.shape))

Data Type: <class 'pandas.core.frame.DataFrame'>
Dataset Size: 53940
Dataset Shape (Rows, Columns): (53940, 11)


*After that, we will check a brief overview of the available samples in our DataFrame and check the general information describing the DataFrame's data.*

In [5]:
# Visualize the first 5 columns of the dataset
dataframe.head()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [6]:
# Show data types for each columns
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


[Diamond Price Prediction](https://github.com/RuralNative/Diamond_Price_Prediction)