**General questions**

* **Write down steps to complete a regression analysis in a Markdowwn cell**

1. Define and design the sample plan to be analyzed.
    - eg. sample size, number of explanatory variable 
    - Find a relationship between variable.( is a linear /non linear for y=mx+c).
    
2. Collect prepare and explore dataset from the research / available data.
3. create a model, fit-train model and find out the model accuracy and coefficient values and perform evaluation.

Then judge the analysis performed.


# Dataset Source: http://archive.ics.uci.edu/ml/datasets/Abalone

### Data Set Information:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name / Data Type / Measurement Unit / Description
-----------------------------
* Sex / nominal / -- / M, F, and I (infant)
* Length / continuous / mm / Longest shell measurement
* Diameter / continuous / mm / perpendicular to length
* Height / continuous / mm / with meat in shell
* Whole weight / continuous / grams / whole abalone
* Shucked weight / continuous / grams / weight of meat
* Viscera weight / continuous / grams / gut weight (after bleeding)
* Shell weight / continuous / grams / after being dried
* Rings / integer / -- / +1.5 gives the **age** in years


# Questions
* Read the dataset `abalone.data` into a dataframe named as dfR
* Write down all basic necessary steps
* Select X (independent variables 'all columns other than the **Rings** column') and Y (dependent variable the 'Rings' column)
* Split the data in training (70%) and testing (30%)
* while splitting the data in training and testing ensure that with every run same data is picked for training
* Select any two regression algorithms of your choice
* Train the algorithms in 3 frameworks (simple sklearn, TensorFlow, Spark)
* Calculate RMSE for all cases (6 cases= 2 Algo * 3 frameworks)
* Which algorithm has the best accuracy

## Start writing your solution from the following cell. 

In [0]:
# Read the dataset `abalone.data` into a dataframe named as dfR
import pandas
import numpy

In [0]:
dfR = pandas.read_csv

dfR = pandas.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',sep= ',', header= None)

In [11]:
dfR.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9


**Read the `abalone.data` file and set the column names by consulting the section "Name / Data Type / Measurement Unit / Description"**

In [0]:
column_names = ['Sex', 	'Length', 	'Diameter', 	'Height', 	'Whole weight', 	'Shucked weight', 	'Viscera weight', 	'Shell weight', 	'Rings']


In [13]:
dfR = pandas.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',sep= ',', header= None, names=column_names)
dfR.columns


Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings'],
      dtype='object')

In [14]:
dfR.head() #or dfR.head(5)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [15]:
dfR.Sex.value_counts()

M    1528
I    1342
F    1307
Name: Sex, dtype: int64

In [16]:
dfR.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


**Convert the column `Sex` to numbers**

In [0]:
dfR.replace({"Sex":{"I":1, "M":2, "F":3}}, inplace=True)

In [0]:
dfR.to_csv('corrected_csv', index = False, header=False)

In [19]:
corrected_csv_dataset = pandas.read_csv('corrected_csv',sep= ',', header= None, names=column_names)
corrected_csv_dataset.head(5)


Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,3,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [0]:
#corrected_csv_dataset.Rings.multiply(1.5)

In [21]:
dfR.info() # str/object ----> int64  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   int64  
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(2)
memory usage: 293.8 KB


In [22]:
dfR.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,3,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,2,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,1,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


# # Questions
* Select X (independent variables 'all columns other than the **Rings** column') and Y (dependent variable the 'Rings' column)
* Split the data in training (70%) and testing (30%)
* while splitting the data in training and testing ensure that with every run same data is picked for training
* Select any two regression algorithms of your choice
* Train the algorithms in 3 frameworks (simple sklearn, TensorFlow, Spark)
* Calculate RMSE for all cases (6 cases= 2 Algo * 3 frameworks)
* Which algorithm has the best accuracy

In [0]:
x = dfR.iloc[:,[0,1,2,3,4,5,6,7]]
y = dfR.iloc[:,[8]]

In [24]:
x.head(3)

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,2,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,2,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,3,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21


In [25]:
y.head(3)

Unnamed: 0,Rings
0,15
1,7
2,9


In [26]:
dfR.isna().sum()

Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64

In [0]:
# all data is filled alreday !
# no missing value detected.

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=1029, test_size=0.3 , train_size=0.7)

In [29]:
# x = independent variables
print(x.shape, '100%')
print(x_train.shape, '70%')
print(x_test.shape, '30%')
print()

# y = dependent variable
print(y.shape, '100%')
print(y_train.shape, '70%')
print(y_test.shape, '30%')



(4177, 8) 100%
(2923, 8) 70%
(1254, 8) 30%

(4177, 1) 100%
(2923, 1) 70%
(1254, 1) 30%


# # Questions

* Select any two regression algorithms of your choice
* Train the algorithms in 3 frameworks (simple sklearn, TensorFlow, Spark)
* Calculate RMSE for all cases (6 cases= 2 Algo * 3 frameworks)
* Which algorithm has the best accuracy

#Linear Regression

In [0]:
#Linear Regression in SKLearn
from sklearn.linear_model import LinearRegression

In [31]:
my_lin_reg = LinearRegression()
my_lin_reg

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:
my_lin_reg.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [33]:
print("intercept is : ", my_lin_reg.intercept_)
print("coefficient for are in order Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight : ", my_lin_reg.coef_)

intercept is :  [2.26296253]
coefficient for are in order Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight :  [[  0.34628732  -0.44297998   9.97320466  20.84386493   9.88898552
  -20.72111132 -10.93958359   6.19644913]]


In [0]:
predict_ring1 = my_lin_reg.predict(x_test)


In [35]:
from sklearn import metrics

# MSE, MAE:
print("MSE: ",metrics.mean_squared_error(y_test,predict_ring1))
print("MAE: ",metrics.mean_absolute_error(y_test,predict_ring1))


MSE:  4.992197448499818
MAE:  1.5989521245334795


# Logistic Regression

In [0]:
# Logistic Regression in SKLearn
from sklearn.linear_model import LogisticRegression

In [37]:
my_logi_reg = LogisticRegression(random_state=1029)
my_logi_reg

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1029, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [38]:
my_logi_reg.fit(x_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1029, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
predict_ring2= my_logi_reg.predict(x_train)

In [0]:
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

In [41]:
accuracy_score(y_train, predict_ring2) 

0.2716387273349299

In [42]:
my_logi_reg.intercept_

array([-0.9837349 , -1.00207388,  3.51567113,  6.82301037,  6.48521136,
        5.94105138,  4.75357769,  2.71496468,  1.18019755,  0.45418549,
       -0.49966006, -0.8984485 , -0.32308919, -0.83062389, -1.24653047,
       -1.47498424, -1.74471581, -2.73654304, -2.87541371, -2.60940851,
       -2.67760154, -2.55464926, -3.23570634, -2.25505906, -1.69656811,
       -2.22305912])

# Train the algorithms in 3 frameworks (simple sklearn-done above already, TensorFlow, Spark)

# Tensorflow

In [0]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [44]:
dfR.keys()

Index(['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight',
       'Viscera weight', 'Shell weight', 'Rings'],
      dtype='object')

In [45]:
%tensorflow_version 2.x

import tensorflow
from tensorflow import keras
import numpy
import matplotlib.pyplot as matPlotLibPyPlot
print(tensorflow.__version__)



2.2.0-rc2


In [0]:
def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation='relu'), layers.Dense(64, activation='relu'),   layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)
  model.compile(loss='mse',optimizer=optimizer,metrics=['mae', 'mse'])
  return model

In [0]:
model = build_model()

In [48]:
model.fit(x_train, y_train, epochs =10)

Epoch 1/10


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f53bff8dd68>

# Sparks

In [72]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u242-b08-0ubuntu3~18.04).
0 upgraded, 0 newly installed, 0 to remove and 25 not upgraded.


In [0]:
from __future__ import print_function
# $example on$
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# $example off$
from pyspark.sql import SparkSession
if __name__ == "__main__":
    spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

In [0]:
# ------------------> I dont need this becasue I am using my own version csv file(whch is pre replaced SEX--> Integer)
# from google.colab import files
# uploaded = files.upload()

In [0]:
df_dataset_spark = spark.read.format("csv").option("header","false").load("corrected_csv").toDF('Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings')

In [76]:
df_dataset_spark.head(3)

[Row(Sex='2', Length='0.455', Diameter='0.365', Height='0.095', Whole weight='0.514', Shucked weight='0.2245', Viscera weight='0.10099999999999999', Shell weight='0.15', Rings='15'),
 Row(Sex='2', Length='0.35', Diameter='0.265', Height='0.09', Whole weight='0.2255', Shucked weight='0.0995', Viscera weight='0.0485', Shell weight='0.07', Rings='7'),
 Row(Sex='3', Length='0.53', Diameter='0.42', Height='0.135', Whole weight='0.677', Shucked weight='0.2565', Viscera weight='0.1415', Shell weight='0.21', Rings='9')]

In [0]:
df_dataset_spark = df_dataset_spark.withColumn("Sex", df_dataset_spark["Sex"].cast("Integer"))
df_dataset_spark = df_dataset_spark.withColumn("Length", df_dataset_spark["Length"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Diameter", df_dataset_spark["Diameter"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Height", df_dataset_spark["Height"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Whole weight", df_dataset_spark["Whole weight"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Shucked weight", df_dataset_spark["Shucked weight"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Viscera weight", df_dataset_spark["Viscera weight"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Shell weight", df_dataset_spark["Shell weight"].cast("Float"))
df_dataset_spark = df_dataset_spark.withColumn("Rings", df_dataset_spark["Rings"].cast("Integer"))

In [78]:
df_dataset_spark.schema

StructType(List(StructField(Sex,IntegerType,true),StructField(Length,FloatType,true),StructField(Diameter,FloatType,true),StructField(Height,FloatType,true),StructField(Whole weight,FloatType,true),StructField(Shucked weight,FloatType,true),StructField(Viscera weight,FloatType,true),StructField(Shell weight,FloatType,true),StructField(Rings,IntegerType,true)))

In [79]:
df_dataset_spark.head(3)

[Row(Sex=2, Length=0.45500001311302185, Diameter=0.36500000953674316, Height=0.0949999988079071, Whole weight=0.5139999985694885, Shucked weight=0.22450000047683716, Viscera weight=0.10100000351667404, Shell weight=0.15000000596046448, Rings=15),
 Row(Sex=2, Length=0.3499999940395355, Diameter=0.26499998569488525, Height=0.09000000357627869, Whole weight=0.22550000250339508, Shucked weight=0.09950000047683716, Viscera weight=0.048500001430511475, Shell weight=0.07000000029802322, Rings=7),
 Row(Sex=3, Length=0.5299999713897705, Diameter=0.41999998688697815, Height=0.13500000536441803, Whole weight=0.6769999861717224, Shucked weight=0.2565000057220459, Viscera weight=0.14149999618530273, Shell weight=0.20999999344348907, Rings=9)]

In [0]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

In [81]:
assembler = VectorAssembler(inputCols=['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 
                                       'Shucked weight', 'Viscera weight', 'Shell weight'], outputCol = 'featuresVector')
final1 = assembler.transform(df_dataset_spark)
final1.show(5)

+---+------+--------+------+------------+--------------+--------------+------------+-----+--------------------+
|Sex|Length|Diameter|Height|Whole weight|Shucked weight|Viscera weight|Shell weight|Rings|      featuresVector|
+---+------+--------+------+------------+--------------+--------------+------------+-----+--------------------+
|  2| 0.455|   0.365| 0.095|       0.514|        0.2245|         0.101|        0.15|   15|[2.0,0.4550000131...|
|  2|  0.35|   0.265|  0.09|      0.2255|        0.0995|        0.0485|        0.07|    7|[2.0,0.3499999940...|
|  3|  0.53|    0.42| 0.135|       0.677|        0.2565|        0.1415|        0.21|    9|[3.0,0.5299999713...|
|  2|  0.44|   0.365| 0.125|       0.516|        0.2155|         0.114|       0.155|   10|[2.0,0.4399999976...|
|  1|  0.33|   0.255|  0.08|       0.205|        0.0895|        0.0395|       0.055|    7|[1.0,0.3300000131...|
+---+------+--------+------+------------+--------------+--------------+------------+-----+--------------

In [82]:
finalized_data = final1.select("featuresVector","Rings")
finalized_data.show(4)

+--------------------+-----+
|      featuresVector|Rings|
+--------------------+-----+
|[2.0,0.4550000131...|   15|
|[2.0,0.3499999940...|    7|
|[3.0,0.5299999713...|    9|
|[2.0,0.4399999976...|   10|
+--------------------+-----+
only showing top 4 rows



In [0]:
training_data,test_data = finalized_data.randomSplit([0.7, 0.3])

my_logi_reg3 = LinearRegression(featuresCol='featuresVector', labelCol='Rings')

my_logi_reg4 = my_logi_reg3.fit(training_data)


In [0]:
pred= my_logi_reg4.evaluate(test_data)

In [85]:
pred.predictions.show()

+--------------------+-----+------------------+
|      featuresVector|Rings|        prediction|
+--------------------+-----+------------------+
|[1.0,0.1099999994...|    3| 4.698414175456747|
|[1.0,0.1500000059...|    2| 4.670538893661115|
|[1.0,0.1599999964...|    3| 4.754730736683914|
|[1.0,0.1749999970...|    4| 5.101266798841985|
|[1.0,0.1850000023...|    4| 5.153785447567115|
|[1.0,0.1850000023...|    6|10.030812165526106|
|[1.0,0.1899999976...|    4| 5.355266612758289|
|[1.0,0.2000000029...|    5| 5.203253918696742|
|[1.0,0.2000000029...|    4| 5.516031757387704|
|[1.0,0.2000000029...|    4| 5.306100380124402|
|[1.0,0.2049999982...|    5|  5.27258248307767|
|[1.0,0.2049999982...|    4| 5.501051755410391|
|[1.0,0.2099999934...|    4| 5.428823223837269|
|[1.0,0.2099999934...|    4| 5.480331922262419|
|[1.0,0.2150000035...|    5| 5.318174500873329|
|[1.0,0.2150000035...|    3| 5.490554698665735|
|[1.0,0.2249999940...|    4| 5.590958916241868|
|[1.0,0.2249999940...|    4| 5.844779319

In [0]:
coefficient = my_logi_reg4.coefficients
intercept= my_logi_reg4.intercept

In [87]:
print ("Coefficients are:", coefficient)
print ("Intercept is:", intercept)

Coefficients are: [0.3798189161346952,-3.7361893985788366,14.726100931155132,9.435661822766622,9.092709566365494,-19.441399795736938,-9.40173201110121,7.763470880629463]
Intercept is: 3.0925319766022685


In [0]:
# Import the RegressionEvaluator library to for presentation of RMSE, MSE and MAE

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

In [0]:
t2 = RegressionEvaluator(labelCol= 'Rings', predictionCol='prediction', metricName='rmse')

In [91]:
print ("RMSE: ", t2.evaluate(pred.predictions))

RMSE:  2.1723952138291094


In [92]:
print (f"MSE: {t2.evaluate(pred.predictions, {t2.metricName:'mse'})}")

MSE: 4.719300965067621


In [93]:
print (f"MAE: {t2.evaluate(pred.predictions, {t2.metricName:'mae'})}")

MAE: 1.5840571056851338
