[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PaHenriquez/448_Artificial_Intelligence/blob/main/Assignment3.ipynb)

## Imports

In [38]:
import pandas
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures

## Part 1. Loading the dataset

In [39]:
# Using pandas load the dataset (load remotely, not locally)
# Output the first 15 rows of the data
# Display a summary of the table information (number of datapoints, etc.)

lab_data = pandas.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

lab_data.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [40]:
lab_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [41]:
# Take the pandas dataset and split it into our features (X) and label (y)
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)

x = lab_data[['Temperature °C','Mols KCL']]
y = lab_data['Size nm^3']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.10,random_state = 42)


## Part 3. Perform a Linear Regression

In [42]:
# Use sklearn to train a model on the training set

# Create a sample datapoint and predict the output of that sample with the trained model

# Report on the score for that model, in your own words (markdown, not code) explain what the score means

# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX

reg = LinearRegression()

reg.fit(x_train,y_train)

predictions = reg.predict(x_test)

actual_vs_predicted = pandas.DataFrame({'Actual': y_test,'Predicted':predictions})

actual_vs_predicted.head(5)


Unnamed: 0,Actual,Predicted
521,117762.3,235911.2
737,868729.3,830451.8
740,1084893.0,969401.7
660,1716039.0,1326783.0
411,953685.0,884987.0


In [43]:
print('Score = ',reg.score(x_test,y_test) )


Score =  0.8552472077276095


## Meaning of score

For the linear regression, the score represents R squared. R squared is a statistical measure of the amount of variance around the fitted values or in otherwords how close the data fits to the fitted regression line. This score can be represented as a percent conversion of about roughly 86% meaning the trained model has 86% accuracy predicting the size of the green blob from the amount of Potassium Chloride and heat given.

In [44]:
temperature_coef = reg.coef_[0]
mols_coef = reg.coef_[1] 
intercept = reg.intercept_

print("Temperature coefficient = ", temperature_coef,"\n")
print("Mols coefficient = ", mols_coef, "\n")
print("Intercept = ", intercept)

Temperature coefficient =  866.1464133719206 

Mols coefficient =  1032.6950664857964 

Intercept =  -409391.4795834075


## Equation

$ h(x) = -409391 + 1032(mols) + 866(temp) $

## Part 4. Use Cross Validation

In [45]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data

# Report on their finding and their significance


# k-fold for cross validation is defaulted at k = 5

results = cross_val_score(reg, x_train, y_train)

print("Cross validations results = ", results, "\n")

mean, standard_deviation = results.mean(), results.std()

print("%0.2f accuracy with a standard deviation of %0.2f" % (mean, standard_deviation))


Cross validations results =  [0.86226163 0.81982226 0.88938198 0.86663176 0.85729958] 

0.86 accuracy with a standard deviation of 0.02


## Findings and significance on the use of cross validation

The significace of the cross validation is that it can estimate the accuracy of a model predicting new data and limit issues such as overfitting. The cross validation function takes the k-fold approach where the dataset is split into k subsets where the k-1 subset is used as the validation set for comparsion. By doing this in a loop k times, the cross validation results is an array of accuracy scores where a more accurate accuracy can be produced.

## Part 5. Using Polynomial Regression

In [46]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2

# Report on the metrics and output the resultant equation as you did in Part 3.


poly = PolynomialFeatures(degree = 2)

Xtest = poly.fit_transform(x_test)
Xtrain = poly.fit_transform(x_train)
line_reg = LinearRegression()
line_reg.fit(Xtrain,y_train)

Y_prediction = line_reg.predict(Xtest)
data_comparsion = pandas.DataFrame({'Actual':y_test, 'Predicted':Yprediction})
data_comparsion.head()


Unnamed: 0,Actual,Predicted
521,117762.3,117762.3
737,868729.3,868729.3
740,1084893.0,1084893.0
660,1716039.0,1716039.0
411,953685.0,953685.0


In [47]:
Scoring = line_reg.score(Xtest,y_test)

print("Score = ", Scoring, "\n")
print("Coefficients = ", line_reg.coef_, "\n")
print("Intercept = ", line_reg.intercept_)


Score =  1.0 

Coefficients =  [ 0.00000000e+00  1.20000000e+01 -1.27196671e-07  1.26476607e-11
  2.00000000e+00  2.85714287e-02] 

Intercept =  2.0479201339185238e-05


## Equation

$ h(x) = 0.00 + 12.00a - 0.00b + 0.00a^2 + 2.00ab + 0.29b^2$

## Findings

From using polynomial regression, the score has 1.0 evaluation which implies that the model can predict the size of the blob with complete accuracy upon chosen temperature and mol.